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Abstract 



This report surveys the major developments in sequential Prolog implementation during the 
period 1983-1993. In this decade, implementation technology has matured to such a degree 
that Prolog has left the university and become useful in industry The survey is divided into four 
parts. The first part gives an overview of the important technical developments starting with the 
Warren Abstract Machine (WAM). The second part presents the history and the contributions 
of the major software and hardware systems. The third part charts the evolution of Prolog 
performance since Warren's DEC-10 compiler. The fourth part extrapolates current trends 
regarding the evolution of sequential logic languages, their implementation, and their role in 
the marketplace. 



Resume 



Ce rapport passe en revue les developpements majeurs d' implantation de Prolog sequentiel 
pendant les annees 1983-1993. Dans cette periode, la technologie d' implantation a muri 
considerablement. Prolog n'est plus seulement utilise dans les universites mais est devenu 
utile pour l'industrie. La revue est divisee en quatre parties. La premiere donne un resume 
des developpements techniques importants a partir de la machine abstraite de Warren, la WAM 
("Warren Abstract Machine"). La seconde partie presente l'histoire et les contributions des 
logiciels et materiels. La troisieme partie montre revolution des performances des systemes 
Prolog depuis le compilateur DEC-10 de Warren. La quatrieme partie extrapole, a partir des 
tendances actuelles, revolution des langages logiques sequentiels, leur implantation, et leur 
impact economique. 
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De Prolog van Tachtig was zonder twijfel prachtig, 
maar de Prolog van Thans maakt ook geen kwade kans. 
- Dr. D. von Tischtiegel, Ongerijmde Rijmen. 



1 Introduction 

This report is a personal view of the progress made in sequential Prolog implementation from 
1983 to 1993, supplemented with learning of the wise [10]. 1983 was a serendipitous year 
in two ways, one important and one personal. In this year David H. D. Warren published his 
seminal technical report [163] on the New Prolog Engine, which was later christened the WAM 
(for Warren Abstract Machine). 1 This year also marks the beginning of my research career in 
logic programming. 

The title reflects my view that the period 1983-1993 represents the "coming of age" of sequential 
Prolog implementation. In 1983, most Prolog programmers (except for a lucky few at Edinburgh 
and elsewhere) were still using interpreters. In 1993 there are many high quality compilers, 
and the fastest of these are approaching or exceeding the speed of imperative languages. Prolog 
has found a stable niche in the marketplace. Commercial systems are of high quality with a full 
set of desirable features and enough large industrial applications exist to prove the usefulness 
of the language [102, 103]. 

1.1 The Influence of the WAM 

The development of the WAM in 1983 marked the beginning of a veritable "gold rush" for 
Prolog developers, all eager for that magical moment when their very own system would be up 
and running. 

David Warren presented the WAM in a memorable talk at U.C. Berkeley in October 1983. 
This talk was full of mystery, and I remember being amazed at how append/3 was compiled 
into WAM instructions. The sense of mystery was enhanced by the strange names of the 
instructions: put, get, unify, variable, value, execute, proceed, try, retry, and trust. 

The WAM is simple on the outside (a small, clean instruction set) and complex on the inside 
(the instructions do complex things). This simultaneously helped and hindered implementation 
technology. Because the WAM is complex on the inside, for a long time many people used 
it "as is" and were content with its level of performance. Because the WAM is simple on the 
outside, it was a perfect environment for extensions. After a few years, people were extending 
the WAM left and right (see Section 2.3). Papers on yet another WAM extension for a new 
logic language were (and are) very common. 

The quickest way to get an implementation of a new logic language is to write an interpreter in 

1 The name WAM is due to the logic programming group at Argonne National Laboratory. 
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Prolog. In the past, the quickest way to get an efficient implementation was usually to extend the 
WAM. Nowadays, it is often better to compile the language into an existing implementation. For 
example, the QD-Janus system [39] is a sequential implementation of Janus (a flat committed- 
choice language) on top of SICStus Prolog (see Section 3.1.9). Performance is reasonable 
partly because SICStus provides efficient support for coroutining. 

If the language is sufficiently different from Prolog, then it is better to design a new abstract 
machine. For example, the AProlog language [100] was implemented with MALI [20]. AProlog 
generalizes Prolog with predicate and function variables and typed A-terms, while keeping 
the familiar operational and least fixpoint semantics. MALI is a general-purpose memory 
management library that has been optimized for logic programming systems. 

1 .2 Organization of the Survey 

The survey is divided into four parts. The first part (Section 2) gives an overview from the 
viewpoint of implementation technology. The second part (Section 3) gives an overview from 
the viewpoint of the systems (both software and hardware) that were responsible for particular 
developments. The vantage points of the two parts are complementary, and there is some 
overlap in the developments that are discussed. The third part (Section 4) summarizes the 
evolution of Prolog performance from the perspective of the Warren benchmarks. The fourth 
part (Section 5) extrapolates current implementation trends into the future. Finally, Section 6 
recapitulates the main developments and concludes the survey. 

A large number of Prolog systems have been developed. The subset included in this survey 
covers systems that are popular {e.g., SICStus Prolog), are good examples of a particular class 
of systems (e.g., CHIP for constraint languages), or are especially innovative (e.g., Parma). 
They all have implementations on Unix workstations. I have done my best to contact everyone 
who has made a significant contribution. There are Prologs that exist only on other platforms, 
e.g., on PCs (Arity, LPA, Delphia) and on Lisp machines (LMI, Symbolics). There is relatively 
little publicly available information about these systems, and therefore I do not cover them in 
this report. 

2 The Technological View 

This section gives an overview of Prolog implementation technology. Section 2. 1 gives a 
brief history of the pre-WAM days (before 1983) and presents the main principle of Prolog 
compilation. Section 2.2 presents and justifies the WAM as Warren originally defined it. 
Section 2.3 explores a few of the myriad systems it has engendered. Section 2.4 highlights 
recent developments that break through its performance barrier. Section 2.5 presents some 
promising execution models different from the WAM. 

Prolog systems can be divided into two categories: structure-sharing or structure-copying. 
The idea of structure sharing is due to Boyer and Moore [19]. Structure copying was first 
described by Bruynooghe [21, 22]. The distinction is based on how compound terms are 
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represented. In a structure-sharing representation, all compound terms are represented as a 
pair of pointers (called a molecule): one pointer to an array containing the values of the term's 
variables, and another pointer to a representation of the term's nonvariable part (the skeleton). 
In a structure-copying representation, all compound terms are represented as record structures 
with one word identifying the main functor followed by an array of words giving its arguments. 
It is faster to create terms in a structure-sharing representation. It is faster to unify terms in 
a structure-copying representation. Memory usage of both techniques is similar in practice. 
Early systems were mostly structure-sharing. Modern systems are mostly structure-copying. 
The latter includes WAM-based systems and all systems discussed in this survey, except when 
explicitly stated otherwise. 

2.1 Before the Golden Age 

The insight that deduction could be used as computation was developed in the 1960's through 
the work of Cordell Green and others. Attempts to make this insight practical failed until 
the conception of the Prolog language by Alain Colmerauer and Robert Kowalski in the early 
1970's. It is hard to imagine the leap of faith this required back then: to consider a logical 
description of a problem as a program that could be executed efficiently. The early history is 
presented in [32], and interested readers should look there for more detail. 

The work on Prolog was preceded by the Absys system. Absys (from Aberdeen System) 
was designed and implemented at the University of Aberdeen in 1967. This system was an 
implementation of pure Prolog [46]. For reasons that are unclear but that are probably cultural, 
Absys did not become widespread. 

Several systems were developed by Colmerauer's group. The first system was an interpreter 
written in Algol-W by Philippe Roussel in 1972. This interpreter served to give users enough 
programming experience so that a refined second system could be built. The second system 
was a structure-sharing interpreter written in Fortran in 1973 by Gerard Battani, Henri Meloni, 
and Rene Bazzoli, under the supervision of Roussel and Colmerauer. This system's operational 
semantics and its built-ins are essentially the same as in modern Prolog systems, except for the 
setof/3 and bagof/3 built-ins which were introduced by David Warren in 1980 [162] . The system 
had reasonable performance and was very influential in convincing people that programming 
in logic was a viable idea. 

In particular, David Warren from the University of Edinburgh was convinced. He wrote the 
Warplan program during his two month stay in Marseilles in 1974 [30]. Warplan is a general 
problem solver that searches for a plan (a list of actions) that transforms an initial state to a 
goal state. 

2. 1. 1 The First Compiler: DEC-10 Prolog 

Back in Edinburgh and thinking about a dissertation topic, Warren was intrigued by the idea 
of building a compiler for Prolog. An added push for this idea was the fact that the parser for 
the interpreter was written in Prolog itself, and hence was very slow. It took about a second to 
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parse each clause and users were beginning to complain. 

By 1977 Warren had developed DEC- 10 Prolog, the first Prolog compiler [159]. This landmark 
system was built with the help of Fernando Pereira and Luis Pereira. 2 It is structure-sharing 
and supports mode declarations. It was competitive in performance to Lisp systems of the 
day and was for many years the highest performance Prolog system. Its syntax and semantics 
became the de facto standard, the "Edinburgh standard". The 1980 version of this system had 
a heap garbage collector and last call optimization (see Section 2.2.4) [160]. It was the first 
system to have either. An attempt to commercialize this system failed because of the demise of 
the DEC- 10/20 machines and because of bureaucratic problems with the British government, 
which controlled the rights of all software developed with public funds. 

2. 1.2 The Simplification Principle 

The main principle in compiling Prolog is to simplify each occurrence of one of its basic 
operations (namely, unification and backtracking). This principle underlies every Prolog 
compiler. Compiling Prolog is feasible because this simplification is so often possible. For 
example, unification is often used purely as a parameter passing mechanism. Most such cases 
are easily detected and compiled into efficient code. 

It is remarkable that the simplification principle has continued to hold to the present day. It 
is valid for WAM-based systems, native code systems, and systems that do global analysis. 
In the WAM the simplification is done statically (at compile-time) and locally [79]. The 
simplification can also be done dynamically (with run-time tests) and globally. An example of 
dynamic simplification is clause selection (see Section 2.4.3). Examples of global simplification 
are global analysis (see Sections 2.4.5 and 2.4.6) and the two-stream unification algorithm (see 
Section 2.4.2). The latter compiles the unification of a complete term as a whole, instead of 
compiling each functor separately like the WAM. 

2.1.3 Bridging the Gap Between DEC- 10 Prolog and the WAM 

An important early system is the C-Prolog interpreter, which was developed at Edinburgh in 
1982 by Fernando Pereira, Luis Damas, and Lawrence Byrd. It is based on EMAS Prolog, a 
system completed in 1980 by Luis Damas. C-Prolog was one of the best interpreters, and is still 
a very usable system. It did much to create a Prolog programming community and to establish 
the Edinburgh standard. It is cheap, robust, portable (it is written in C), and fast enough for 
real programs. 

There were several compiled systems that bridged the gap between the DEC- 10 compiler (1977- 
1980) and the WAM (1983) [17, 28]. They include Prolog-X and NIP (New Implementation of 
Prolog). David Bowen, Lawrence Byrd, William Clocksin, and Fernando Pereira at Edinburgh 
were the main contributors in this work. These systems miss some of the WAM's good 
optimizations: separate choice points and environments, argument passing in registers instead 
of on the stack, and clause selection (indexing). David Warren left Edinburgh for SRI in 1981. 

2 They are not related. 
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Figure 1 : The Correspondence Between Logical and Imperative Concepts 



According to Warren, the WAM design was an outcome of his own explorations and was not 
influenced by this work. 

2.2 The Warren Abstract Machine (WAM) 

By 1983 Warren had developed the WAM, a structure-copying execution model for Prolog that 
has become the de facto standard implementation technique [163]. The WAM defines a high- 
level instruction set that maps closely to Prolog source code. This section concisely explains 
the original WAM. In particular, the many optimizations of the WAM are given a uniform 
justification. This section assumes a basic knowledge of how Prolog executes [85, 115, 130] 
and of how imperative languages are compiled [3]. 

For several years, Warren's report was the sole source of information on the WAM, and its terse 
style gave the WAM an aura of inscrutability. Many people learned the WAM by osmosis, 
gradually absorbing its meaning. Nowadays, there are texts that give lucid explanations of the 
WAM and WAM-like systems [4, 85]. 

There are two main approaches to efficient Prolog implementation: emulated code and native 
code. Emulated code compiles to an abstract machine and is interpreted at run-time. Native 
code compiles to the target machine and is executed directly. Native code tends to be faster 
and emulated code tends to be more compact. With care, both approaches are equally portable 
(see Section 5.1). The original WAM is designed with an emulated implementation in mind. 
For example, its unification instructions are more suited to emulated code (see Section 3.1.4). 
The two-stream unification algorithm of Section 2.4.2 is more suited to native code. 
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2.2. 1 The Relationship of the WAM to Prolog and Imperative Languages 

The execution of Prolog is a natural generalization of the execution of imperative languages 
(see Figure 1). It can be summarized as: 

Prolog = imperative language 
+ unification 
+ backtracking 

As in imperative languages, control flow is left to right within a clause. The goals in a 
clause body are called like procedures. A goal corresponds to a predicate. When a goal is 
called, the clauses in the predicate's definition are chosen in textual order from top to bottom. 
Backtracking is chronological, i.e., control goes back to the most recently made choice and 
tries the next clause. Hence, Prolog is a somewhat limited realization of logic programming, 
but in practice its trade-offs are good enough for a logical and efficient programming style to 
be possible [113]. 

The WAM mirrors Prolog closely, both in how the program executes and in how the program 
is compiled: 

WAM - sequential control (call/return/jump instructions) 
+ unification (get/put/unify instructions) 
+ backtracking (try/retry/trust instructions) 
+ optimizations (to use as little memory as possible) 

The WAM has a stack-based structure, of which a subset is similar to imperative language 
execution models. It has call and return instructions and local frame (environment) management 
instructions. It is extended with instructions to perform unification and backtracking. These 
form the core of the WAM. Around this core, the WAM has added optimizations intended to 
reduce memory usage. 

Prolog as executed by the WAM defines a close mapping between the terminology of logic and 
that of an imperative language (see Figure 1). Predicates correspond to procedures. Procedures 
always have a case statement as the first part of their definition. Clauses correspond to the 
branches of this case statement. Variables are scoped locally to a clause. 3 Goals in a clause 
correspond to calls. Unification corresponds to parameter passing and assignment. Other 
features do not map directly: backtracking, the single-assignment nature, and the modification 
of control flow with the cut operation. Cut is a language feature that increases the determinism 
of a program by removing choice points. 

The WAM is a good intermediate language in the sense that writing a Prolog-to-WAM compiler 
and a WAM emulator are both straightforward tasks. A compiler and emulator can be built 
without a deep understanding of the internals of Prolog or the WAM. 

3 Global variables and self-modifying code are possible with the assert/1 and retract/1 built-ins. These built-ins 
are potentially nonlogical and certainly inefficient, and hence should be infrequent. 
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Table 1: The Internal State of the WAM 



2.2.2 Data Structures and Memory Organization 

Prolog is a dynamically typed language, i.e., variables may contain objects of any type at run- 
time. Hence, it must be possible to determine the type of an object at run-time by inspection. 4 
In the WAM, terms are represented as tagged words: a word contains a tag field and a value 
field. The tag field contains the type of the term (atom, number, list, or structure). See [52] for 
an exhaustive presentation of alternative tagging schemes. The value field is used for different 
purposes depending on the type: it contains the value of integers, the address of unbound 
variables and compound terms (lists and structures), and it ensures that each atom has a value 
different from all other atoms. Unbound variables are implemented as self -referential pointers, 
i.e., they point to themselves. When two variables are unified, one of them is modified to point 
to the other. 5 Therefore it may be necessary to follow a chain of pointers to access a variable's 
value. This is called dereferencing the variable. 

Table 1 shows how the internal state of the WAM is stored in registers. The purpose of 
most registers is straightforward. The HB register caches the value of H stored in the most 
recent choice point. The S register is used during unification of compound terms (with 
arguments): it points to an argument being unified. All arguments can be accessed one by one 
by successively incrementing S. Some instructions have different behaviors during read and 
write mode unification; the mode flag is used to distinguish between them (see Section 2.2.3). 
In the original WAM, the mode flag is implicit (it is encoded in the program counter). 

The external state (stored in memory) is divided into six logical areas (see Figure 2): two 
stacks for the data objects, one stack (the PDL) to support unification, one stack (the trail) to 
support the interaction of unification and backtracking, one area as code space, and one area 
as a symbol table. 



4 Unless the type can be determined at compile-time. 

5 More precisely, variable-variable unification can be implemented with a Union-Find algorithm [91]. With this 
algorithm, unifying n variables requires 0(na(n)) time, where a(n) is the inverse Ackermann function. 
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Figure 2: The External State of the WAM 



• The global stack or heap. This stack holds lists and structures, the compound data 
terms of Prolog. 

• The local stack. This stack holds environments and choice points. Environments (also 
known as local frames or activation records) contain variables local to a clause. Choice 
points encapsulate execution state for backtracking, i.e., they are continuations. A variant 
model, the split-stack, uses separate stacks for environments and choice points. There 
is no significant performance difference between the split-stack and the merged-stack 
models. The merged-stack model uses more memory if choice points are created. 

• The trail. This stack is used to save locations of bound variables that have to be unbound 
on backtracking. Saving the addresses of variables is called trailing, and restoring them 
to being unbound is called detrailing. Not all variables that are bound have to be trailed. 
A variable must only be trailed if it continues to exist on backtracking, i.e., if its location 
on the global or local stack is older than the top of this stack stored in the most recent 
choice point. This is called the trail condition. Performing it is called the trail check. 
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• The push-down list (PDL). This stack is used as a scratch-pad during the unification of 
nested compound terms. Often the PDL does not exist as a separate stack, e.g., the local 
stack is used instead. 

• The code area. This area holds the compiled code of a program. It is not recovered on 
backtracking. 

• The symbol table. This area is not mentioned in the original article on the WAM. It 
holds various kinds of information about the symbols (atoms and structure names) used 
in the program. It is not recovered on backtracking. It contains the mapping between 
the internal representation of symbols and their print names, information about operator 
declarations, and various system-dependent information related to the state of the system 
and the external world. Because creating a new entry is relatively expensive, symbol 
table memory is most often not recovered on backtracking. It may be garbage collected. 
Systems that manipulate arbitrary numbers of new atoms (e.g., systems with a database 
interface) must have garbage collection. 

It is possible to vary the organization of the memory areas somewhat without changing anything 
substantial about the execution. For example, some systems have a single data area (sometimes 
called the firm heap) that combines the code area and symbol table. 

2.2.3 The Instruction Set 

The WAM instruction set, along with a brief description of what each instruction does, is 
summarized in Table 2. Unification of a variable with a data term known at compile-time 
is decomposed into instructions to handle the functor and arguments separately (see Fig- 
ures 3 and 4). There are no unify Jist and unify .structure instructions; they are left out 
because they can be implemented using the existing instructions. The switch_on_constant and 
switch_on_structure instructions fall through if A \ is not in the hash table. The original WAM 
report does not talk about the cut operation, which removes all choice points created since 
entering the current predicate. Implementations of cut are presented in [4, 85]. A variable 
stored in the current environment (pointed to by E) is denoted by F,-. A variable stored in a 
register is denoted by X{ or A,-. A register used to pass arguments is denoted by A,-. A register 
used only internally to a clause is denoted by Xj. The notation V,- is shorthand for X; or F;. The 
notation 7?, is shorthand for X, or A, . 

A useful optimization is the variable/value annotation. Instructions annotated with "variable" 
assume that their argument has not yet been initialized, i.e., it is the first occurrence of the 
variable in the clause. In this case, the unification operation is simplified. For example, the 
get_variable X2, A\ instruction unifies X-i with A\. Since X2 has not yet been initialized, the 
unification reduces to a move. Instructions annotated with "value" assume that their argument 
has been initialized (i.e., all later occurrences of the variable). In this case, full unification is 
done. 

Figures 3 and 4 give the Prolog source code and the compiled WAM code for the predicate 
append/3. The mapping between Prolog and WAM instructions is straightforward (see Sec- 
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Loading argument registers (just before a call) 



put.variable V n , Ri 
put_value V n , Ri 
put_constant C, Ri 
put_nil Ri 

put_structure F/N, Ri 
put_list Ri 



Create a new variable, put in V n and /?,-. 

Move V n to R { . 

Move the constant C to R{. 

Move the constant nil to 7?,. 

Create the functor F/N, put in Ri. 

Create a list pointer, put in /?,-. 



Unifying with registers (head unification) 



get_variable V n , R{ 
get_value V n , Ri 
get_constant C, Ri 
get_nil Rj 

get_structure F/N, Ri 
getJist Ri 



Move Ri to V n . 

Unify V n with 7?,. 

Unify the constant C with Rj. 

Unify the constant nil with Rj. 

Unify the functor F/N with R { 

Unify a list pointer with R{. 



Unifying with structure arguments (head unification) 



unify .variable V n 
unify .value V n 
unify .constant C 
unify .nil 
unify .void N 



Move next structure argument to V n . 

Unify V n with next structure argument. 

Unify the constant C with next structure argument. 

Unify the constant nil with next structure argument. 

Skip next N structure arguments. 



Managing unsafe variables (an optimization; see Section 2.2.4) 



put.unsafe.value V n , Ri 
unify Jocal.value V„ 



Move V n to Rt and globalize. 

Unify V n with next structure argument and globalize. 



Procedural control 



call P, N 
execute P 
proceed 
allocate 
deallocate 



Call predicate P, trim environment size to N. 

Jump to predicate P. 

Return. 

Create an environment. 
Remove an environment. 



Selecting a clause (conditional branching) 



switch.on.term V, C, L, S 
switch.on.constant N, T 
switch_on_structure N, T 



Four- way jump on type of A i . 

Hashed jump (size ./V table at T) on constant in A\ . 

Hashed jump (size Af table at T) on structure in A\ . 



Backtracking (choice point management) 



try_me_else L 
retry .me.else L 
trust_me_else fail 
tryL 
retry L 
trust L 



Create choice point to L, then fall through. 
Change retry address to L, then fall through. 
Remove top-most choice point, then fall through. 
Create choice point, then jump to L. 
Change retry address, then jump to L. 
Remove top-most choice point, then jump to L. 



Table 2: The Complete WAM Instruction Set 
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append([], L, L). 

append([X|Ll], L2, [X|L3]) :- append(Ll, L2, L3). 



Figure 3: The Prolog Code for append/3 



append/3: switch_on_term V\,C\,C 2 fail Jump if variable, constant, list, structure. 



tion 2.2.5). The switch instruction jumps to the correct clause or set of clauses depending on 
the type of the first argument. This implements first-argument selection (indexing). The choice 
point (try) instructions link a set of clauses together. The get instructions unify with the head 
arguments. The unify instructions unify with the arguments of structures. 

The same instruction sequence is used to take apart an existing structure (read mode) or to 
build a new structure (write mode). The get instructions set the mode flag, which determine 
whether execution proceeds in read mode or write mode. For example, if get Jist A, sees an 
unbound variable argument, it sets the flag to write mode. If it sees a list argument, it sets the 
flag to read mode. If it sees any other type, it fails, i.e., it backtracks by restoring state from 
the most recent choice point. The unify instructions have different behavior in read and write 
mode. The get instructions initialize the S register and the unify instructions increment the S 
register. 

Choice point handling (backtracking) is done by the try instructions. The try_me_else L 
instruction creates a choice point, i.e., it saves all the machine registers on the local stack. It is 
compiled just before the code for the first clause in a predicate. It causes a jump to label L on 
backtracking. The try L instruction is identical to try_me_else L, except that the destinations 
are switched: try immediately jumps to L. The retry _me_else L instruction modifies a choice 
point that already exists by changing the address that it jumps to on backtracking. It is compiled 
with clauses after the first but not including the last. This means that a predicate with n clauses 
is compiled with n — 2 retry _me_else instructions. The trust_me_else/a// instruction removes 
the top-most choice point from the stack. It is compiled with the last clause in a predicate. 



V 2 : 
C 2 : 



Vy. 
Ci: 



try_me_else V 2 
get_nil A\ 
get_value A 2 , A 3 
proceed 

trust_me_else/a// 
get Jist A i 
unify .variable X4 
unify .variable A 1 
get Jist A3 
unify .value X4 
unify .variable A 3 
execute append/3 



Create choice point if Ai is variable. 

Unify A 1 with nil. 

Unify A 2 and A3 . 

Return to caller. 

Remove choice point. 

Start unification of list in A\ . 

Unify head: move head inioX^. 

Unify tail: move tail into A\. 

Start unification of list in A3 . 

Unify head: unify head withX*. 

Unify tail: move tail into A3 . 

Jump to beginning (last call optimization). 



Figure 4: The WAM Code for append/ 3 
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2.2.4 Optimizations to Minimize Memory Usage 

The core of the WAM is straightforward. What makes it subtle are the added optimizations. 
Because of these optimizations the WAM is extremely memory efficient. For programs with 
sufficient backtracking, a garbage collector is not necessary. The optimizations are explained in 
terms of the following classification of memory, from least to most costly to allocate, deallocate, 
and reuse. 

• Registers (arguments, temporary variables). These are available at any time without 
overhead. 

• Short-lived memory (environments on the local stack). This memory is recovered on 
forward execution, backtracking, and garbage collection. 

• Long-lived memory (choice points on the local stack, data terms on the heap). This 
memory is recovered only on backtracking and garbage collection. 

• Permanent memory (the code area and symbol table). This memory is recovered only 
by garbage collection. 

With this classification, the optimizations can be explained as follows. 

• Prefer registers to memory. There are three optimizations in this category. 

- Argument passing. All procedure arguments are passed in registers. This is im- 
portant because Prolog is procedure-call intensive. For example, the most efficient 
way to iterate is through recursion. Backtracking can express iteration as well, but 
less efficiently. 

- The return address. Inside a procedure, the return address of the immediate caller 
is stored in the CP register. This optimization is closely related to the leaf routine 
calling protocol done in imperative language compilers. 

- Temporary variables. Temporary variables are variables whose lifetimes do not 
cross a call. That is, they are not used both before and after a call. Therefore 
they may be kept in registers. This definition of temporary variables simplifies and 
slightly generalizes Warren's original definition. 

• Prefer short-lived memory to long-lived memory. There are three optimizations in 
this category. 

- Permanent variables. Permanent variables are variables that need to survive a 
call. They may not be kept in registers, but must be stored in memory. They are 
given a slot in the environment. This makes it easy to deallocate their memory if 
they are no longer needed after exiting the predicate (see unsafe variables, below). 

- Environment trimming (last call optimization). Environments are stored on 
the local stack and recovered on forward execution just before the last call in a 
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procedure body. This optimization is known as the tail recursion optimization or 
more accurately, the last call optimization. This is based on the observation that 
an environment's space does not need to exist after the last call, since no further 
computation is done in the environment. The space can be recovered before entering 
the last call instead of after it returns. Because execution will never return to the 
procedure, the last call may be converted into a jump. For recursive predicates, 
this converts recursion into iteration, since the jump is to the first instruction of the 
predicate. The WAM generalizes the last call optimization to be done gradually 
during execution of a clause: the environment size is reduced ("trimmed") after 
each call in the clause body, so that only the information needed for the rest of the 
clause is stored. Trimming increases the amount of memory that is recovered by 
garbage collection. 

- Unsafe variables. A variable whose lifetime crosses a call must be allocated an 
unbound variable cell in memory (i.e., in an environment or on the heap). If it is 
sure that the unbound variable will be bound before exiting the clause, then the 
space for the cell will not be referenced after exiting the clause. In that case the cell 
may be allocated in the environment and recovered with environment trimming. In 
the other case one is not sure that the unbound variable will be bound. This leads 
to the following space-time trade-off. The fastest alternative is to always create 
the variable on the heap. The most memory-efficient alternative is to create the 
variable on the environment and just before trimming the environment, to move the 
variable to the heap if it is unbound. The WAM has chosen the second alternative, 
and the variable being tested is referred to as an "unsafe variable". 

• Prefer long-lived memory to permanent memory. Data objects (variables and com- 
pound terms) disappear on backtracking, and hence all allocated memory for them may 
be put on the heap. In a typical implementation this is not quite true. The symbol table 
and code area are not put on the heap, because their modifications (i.e., newly interned 
atoms and asserted clauses) are permanent. 



Measurements have been done of the unsafe variable trade-off for Quintus Prolog (see Sec- 
tion 3.1.4) and the VAM (see Section 2.5.1) [76]. Tim Lindholm measured the increase of 
peak heap usage for Quintus on a set of programs including Chat-80 [161] and the Quintus test 
suite and compiler. He found that the first alternative increases peak heap usage by 50 to 100% 
for Quintus (see Section 3.1.4). Because this leads to increased garbage collection and stack 
shifting, Lindholm concluded that unsafe variables are useful. 

Andreas Krall measured the increase of peak heap usage on a series of small and medium-size 
programs for the VAM, which stores all unbound variables on the heap. He measured increases 
of from 4% to 26%, with an average of 15%. Because unsafe variables impose a run-time 
overhead (two comparisons instead of one for the trail check and run-time tests for globalizing 
variables), Krall concluded that unsafe variables are not useful. 

The VAM and Quintus execution models are significantly different, so the VAM and Quintus 
measurements cannot be compared directly. My own view is that unsafe variables are useful 
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since the run-time overhead is small and the reduction of heap usage is significant. 
2.2.5 How to Compile Prolog to the WAM 

Compiling Prolog to the WAM is straightforward because there is a close mapping between 
lexical tokens in the Prolog source code and WAM instructions. Figure 5 gives a scheme 
for compiling Prolog to the WAM. For simplicity, the figure omits register allocation and 
peephole optimization. This compilation scheme generates suboptimal code. One can improve 
it by generating switch instructions to avoid choice point creation in some cases. For more 
information on WAM compilation see [116, 148]. 

The clauses of predicate p/3 are compiled into blocks of code that are linked together with 
try instructions. Each block consists of a sequence of get instructions to do the unification of 
the head arguments, followed by a sequence of put instructions to set up the arguments for 
each goal in the body, and a call instruction to execute the goal. The block is surrounded by 
allocate and deallocate instructions to create an environment for permanent variables. The 
last call is converted into a jump (an execute instruction) because of the last call optimization 
(see Section 2.2.4). 

2.3 WAM Extensions for Other Logic Languages 

Many WAM variants have been developed for new logic languages, new computation models, 
and parallel systems. This section presents three significant examples: 

• The CHIP constraint system, which interfaces the WAM with three constraint solvers. 

• The clp(FD) constraint system, which implements a glass box approach that allows 
constraint solvers to be written at the user level. 

• The SLG-WAM, which extends the WAM with memoization. 
2.3. 1 CHIP 

CHIP (Constraint Handling In Prolog) [2] is a constraint logic language developed at ECRC (see 
Section 3. 1 .7 for more information on ECRC). The system has been commercialized by Cosytec 
to solve industrial optimization problems. CHIP is the first compiled constraint language. In 
addition to equality over Prolog terms, CHIP adds three other computation domains: finite 
domains, boolean terms, and linear inequalities over rationals. The CHIP compiler is built on 
top of the SEPIA WAM-based Prolog compiler. The system contains a tight interface between 
the WAM kernel and the constraint solvers. The system extends the WAM to the C-WAM (C 
for Constraint). The C-WAM is quite complex: it has new data structures and over one hundred 
new instructions. Many instructions exist to solve commonly-occurring constraints quickly. 

Measurements of early versions of CHIP showed that a large amount of trailing was being done, 
to the point that many programs quickly ran out of memory. This happened because the trailing 
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choice 
point 




p(E,F,G) :- k(X,F,P) , m(S,T) , ... 

p(A,B,C) :-q(A,Z,W), r(W,T,B), z(A,X) 



p(Q,R,S) 



Original Prolog predicate 



p/3: f- try_me_else L2 



code for 
clause 1 



Ln: 



L2 : V retry_me_else L3 



code for 




clause 2 





trust_me_else fail 



code of 
last clause 



Compiled WAM code 



allocate 


Create environment. 


(get arguments) 


Unify with caller arguments. 


(put arguments) ^ 


\- Load arguments and call. 


call q/3 


(put arguments)~" 


V Load arguments and call. 


call r/3 _^ 


(put arguments) 




deallocate 


Remove environment. 


execute z/2 


Last call is a jump. 



A single compiled clause 



Figure 5: How to Compile Prolog to the WAM 
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was done with the WAM's trail condition (see Section 2.2.2). This condition is appropriate for 
equality constraints, which are implemented by unification in the WAM. For more complex 
constraints, the condition is wasteful because a variable's value is often modified several times 
between two choice points. The CHIP system reduces memory usage by introducing a different 
trail condition called "time-stamping" [1]. Each data term is marked with an identifier of the 
choice point segment the term belongs to (see Section 2.3.1). Trailing is only necessary if the 
current choice point segment is different from the segment stored in the term. Time-stamping 
is an essential technique for any practical constraint solver. 

2.3.2 clp(FD) 

The clp(FD) system [29, 40] is a finite domain solver integrated into a WAM emulator. It 
was built by Daniel Diaz and Philippe Codognet at INRIA (Rocquencourt, France). It uses a 
glass box approach. Instead of considering a constraint solver as a black box (in the manner of 
CHIP), a set of primitive operations is added that allows the constraint solver to be programmed 
in the user language. The resulting system outperforms the hard-wired finite domain solver of 
CHIP. 

In clp(FD), a single primitive constraint is added to the system, namely the range constraint X 
in R, where X is a domain variable and R is a range (e.g. , 1..10). Instead of just using constant 
ranges, the idea is to introduce what are known as indexical ranges, i.e., ranges of the form 
f(Y)..g(Y) or h(Y) where f(Y), g(Y), and h(Y) are functions of the domain variable Y. A set 
of these functions that do local propagation is built-in. For example, the system provides the 
constraints X in min(Y)..max(Y) and X in dom(Y) with the obvious meanings. Arithmetic 
constraints such as X=Y+Z and boolean constraints such as X=Y and Z can be written in terms 
of indexical range constraints. 

Indexical range constraints are smoothly integrated into the WAM by providing support for 
domain variables and suspension queues for the various indexical functions [40]. The time- 
stamping technique of CHIP is used to reduce trailing. 

2.3.3 SLG-WAM 

Memoization is a technique that caches already-computed answers to a predicate. By adding 
memoization to Prolog's resolution mechanism, one obtains an execution model that can do 
both top-down and bottom-up execution. For certain algorithms, this model executes simple 
logical definitions with a lower order of complexity than a pure top-down execution would. 
For example, the recursive definition of the Fibonacci function runs in linear time rather than 
exponential time. More realistic examples are parsing and dynamic programming. 

One realization of memoization is OLDT resolution (Ordered Linear resolution of Definite 
clauses with Tabulation) [131]. A recent generalization, SLG resolution [27], handles negation 
as well. This has been implemented in an abstract machine, the SLG-WAM (previously 
called the OLDT- WAM), and realized in the XSB system (see Section 3.1.8). The current 
implementation executes Prolog code with less than 10% overhead relative to the WAM as 
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implemented in XSB, and is much faster than deductive database systems [132]. An important 
source of overhead is the complex trail: it is a linked list whose elements contain the address 
and old contents of a cell. 

2.4 Beyond the WAM: Evolutionary Developments 

The WAM was a large step towards the efficient execution of Prolog. From the viewpoint of 
theorem proving, Prolog is extremely fast. But there is still a large gap between the efficiency 
of the WAM and that of imperative language implementations. As people started using Prolog 
for standard programming tasks, the gap became apparent and people started to optimize their 
systems. This section discusses the gap and some of the clever ideas that have been developed 
to close it. 

2.4. 1 Chinks in the Armor 

This section lists the limits to Prolog performance and their causes. 

• WAM instructions are too coarse-grained to perform much optimization. For example, 
many WAM instructions perform an implicit dereference operation, even if the compiler 
can determine that such an operation is unnecessary in a particular case. In practice, 
dereference chains are short: dynamic measurements on real programs show that two 
thirds are of length zero (no memory reference is required), one third are of length one, 
and <1% are of length two or greater [145]. Despite these statistics, dereferencing is 
expensive. For example, Aquarius on the VLSI-BAM, a high-performance system with 
hardware support, spends 9% of its total execution time doing dereferencing [152]. 

• The majority of predicates written by human programmers are intended to give at most 
one solution, i.e., they are deterministic. These predicates are in effect case statements, 
yet they are too often compiled in an inefficient manner using the full generality of 
backtracking (which implies saving the machine state and repeated failure and state 
restoration). The WAM's first-argument selection is inadequate to compile these predi- 
cates efficiently (see Section 2.4.3). Measurements of Prolog applications support this 
assertion: 

- Tick shows that shallow backtracking (backtracking from clause to clause within a 
single predicate) dominates even for well-written deterministic programs. Choice 
point references constitute about half (45-60%) of all data references [143]. 

- Touati and Despain show that at least 40% of all choice point and fail operations 
can be removed through optimization [145]. 

• The single-assignment nature of Prolog (i.e., a variable can only be assigned one value 
in forward execution) needs to be handled well. In a straightforward implementation it is 
time-consuming to modify large data structures incrementally, because the programmer 
may use copying of terms to represent incremental changes, and the implementation 
will not optimize this copying away. This problem, also known as the copy avoidance 
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problem, is a special case of the general problem of efficiently representing state modi- 
fication in logic. It is impossible to use large data structures with the same efficiency as 
in procedural languages unless the compiler is able to introduce destructive assignment 
(overwriting of memory locations) in the implementation. Section 5.1 gives suggestions 
on how to get around this problem. 

• Prolog has dynamic typing (variables may contain values of any type) and dynamic 
memory allocation (all data objects are allocated at run-time). Both of these cost 
execution time. They should be compiled statically wherever possible. 

• Programming style has a large effect on a program's efficiency. Prolog programming is 
at a high level of abstraction, so it hides many details of the implementation from the 
programmer, making it difficult to improve efficiency when it is important to do so. For 
example, adding a single cut can make the difference between a program that runs fast 
and one that thrashes. This is possible even if the cut does not change the operational 
semantics of the program. The thrashing behavior is caused by a pile-up of choice 
points during deterministic (forward) execution. Because the choice points encapsulate 
execution states that remain accessible through potential backtracking, their memory is 
not recovered by garbage collection. 

• The apparent need for architectural support. So-called "general-purpose" architectures 
are in fact optimized for imperative languages and number crunching. To run Prolog 
equally well, either the compiler must do more work, or conceivably the architecture 
should be modified. Some experiments have been done with architectures optimized for 
Prolog (among others, the PSI-II, KCM, and VLSI-BAM, see Section 3.2), but the true 
architectural needs of Prolog are a moving target. They depend on the execution model 
and the sophistication of the compiler. As better compilers have been developed, the 
perceived architectural needs of Prolog have been getting smaller and smaller. One need 
likely to stay for a long time is a fast memory system. Prolog's dynamic nature requires 
frequent pointer dereferencing. There are no compilation techniques on the horizon that 
are likely to reduce the resulting need for a fast memory system (see Section 5.1). 

2.4.2 How to Compile Unification: The Two-Stream Algorithm 

This section presents the two-stream unification algorithm, an elegant scheme for compiling 
unification that is more efficient than the WAM for native code implementation. Rough 
measurements comparing unification times of the VLSI-PLM (a microcoded WAM) and the 
VLSI-BAM (see Section 3.2.3) show a speedup factor of two to three [153] in favor of the 
latter. This algorithm was independently reinvented at least four times by different people at 
about the same time: Mohamed Amraoui at the Universite de Rennes I [8], Andre Marien 
and Bart Demoen at BIM and KUL [86, 88], Kent Boortz at SICS [16], and Micha Meier at 
ECRC [94]. Write mode propagation was discussed earlier by Andrew Turk [146]. 

Figures 6 and 8 show how the unification X=f(g(A),h(B)) is compiled in the WAM and by the 
two-stream algorithm. The actions of the instructions are represented as primitive constraints 
of two kinds: functor constraints (such as X=(f/2)) and argument constraints (such as X.1=Y). 
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WAM instructions 



Operations 

(as primitive constraints) 



r 



get_structure f/2, Al 
unify _variable X4 
unify _variable X5 
get_structure g/1, X4 
unify _value A2 
get_structure h/1, X5 
unify _value A3 



X=(f/2) 

X.1=Y 

X.2=Z 

Y=(g/1) 

Y.1=A 

Z=(h/1) 

Z.1=B 



Variable 



X 
A 
B 
Y 
Z 



Register 



Al 
A2 
A3 
X4 
X5 



Figure 6: The WAM Compilation of the Unification X=f(g(A),h(B)) 



Functor and argument constraints correspond to the get and unify instructions in the WAM. 
An important advantage of the primitive constraint representation over the WAM is that the 
constraints may be executed in any order. In addition to providing a powerful conceptual 
description of the WAM, primitive constraints are useful in compiling more advanced logic 
languages [6, 84, 117]. 

The WAM compiles unification as a single sequence of instructions (see Figure 6). This has 
several problems: 

• Write mode is not propagated to subterms. For example, the unification X=f(g(a)) 
is compiled as X=f(T), T=g(a). These two unifications are compiled independently. If 
X is unbound, the fact that T is created as an unbound variable in the first unification 
is not propagated to the second unification. This means a superfluous dereference, a 
superfluous trail check, and a superfluous binding. 

• Instructions have modes. All instructions have two modes of execution, read mode and 
write mode. The current mode is stored in a global mode flag, which is set in getJist and 
get_structure instructions and tested in all unify instructions. Some implementations 
(e.g., the intended implementation of the original WAM report, and Quintus) encode the 
mode flag in the program counter, which avoids the testing overhead. 

• Poor translation to native code. The straightforward method for generating native code 
is to macro-expand the WAM instructions. This means that the read and write mode parts 
are interleaved, which results in many jumps. This is less of a problem on a microcoded 
machine since microcode jumps are often free (the destination address is part of the 
microword). 

The key insight is that unification should be compiled into two instruction streams, one for read 
mode and one for write mode, with conditional jumps between them in both directions. With 
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R stream 



setS^l 
jump L - 



W stream 
set S^O 



-L: 



Sequence of 
instructions 



Selectively executed 
subsequence 



if S=l jump L' 



T 



Figure 7: How to Execute a Particular Subsequence with Low Overhead 



two streams one avoids superfluous operations while keeping a linear code size. The practical 
problems that remain are how to configure the instructions so they work correctly despite being 
jumped to from different places, and how to minimize bookkeeping overhead for the jumps. 

Figure 7 illustrates a technique to execute any subsequence of a main instruction sequence with 
very little overhead. The idea is to give the main sequence an identifier (say the integer 0) and 
the subsequence a different identifier (say the integer 1). Then a single conditional jump is all 
that is required. If the subsequence is non-contiguous, then a single conditional jump to the next 
segment is needed per contiguous segment of the subsequence. If more than one subsequence 
has to be selected, then a unique identifier is needed for each one. The subsequences may be 
overlapping. 

With the idea of selective execution in mind, arrange the primitive constraints of the term 
according to a depth-first traversal of the term (Figure 8). The resulting sequence satisfies the 
property that each subterm corresponds to a contiguous sequence of instructions. This is all 
one needs to implement the algorithm. At run-time, unification follows the read mode stream, 
and selectively executes contiguous parts of the write mode stream for subterms to be created. 

A reduction of bookkeeping overhead is possible based on a second property of the sequence. 
Nested terms correspond to nested sequences of instructions. Number each subterm with 
an integer representing its nesting level within the term. With this numbering, an adjacent 
sequence of conditional jumps back to the read mode stream can be collapsed into a single 
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if S=l jump Ly' 

Z=(h/1) 
Z.1=B 
if S=l jump Lz' 
if S=0 jum p Lx' 



Figure 8: The Two-Stream Compilation of the Unification X=f(g(A),h(B)) 



conditional jump (changing the condition from "=" to "<")• In Figure 8, the two conditional 
jumps if S=l jump Lz' and if S=0 jump Lx' can be rewritten as the single jump if S>0 jump 
Lz'. To collapse the maximum number of jumps, reorder the arguments of all subterms to unify 
the most complex subterms last. 

The advantages of the two-stream algorithm are: 



• Low overhead. The bookkeeping overhead is a small constant factor. The only book- 
keeping is the set of jumps and register moves needed to manage the selective execution 
of subsequences. This is small compared to the work done in the primitive constraints. 
There is no explicit mode flag. 

• Downward propagation of write mode. The write mode of a term is propagated at 
compile-time to all of its subterms. There are no superfluous dereferences, trail checks, 
or bindings. 

• Upward propagation of read mode. The read mode of a term is propagated at compile- 
time to its siblings and ancestors. 

• Linear code size. This contrasts with the algorithm of [150], which expands all cases 
without any sharing. That algorithm has zero bookkeeping overhead, but exponential 
code sizes occur in practice. 

• Efficient expansion to native code. The number of instructions generated is about dou- 
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ble that of the WAM, but the instructions themselves have less than half the complexity. 
The primitive constraints of Figure 8 are expanded differently in the read mode and write 
mode streams. Essentially, the internal operations of the WAM instructions have been 
made visible and arranged in an efficient order. There are no jumps inside the primitive 
constraints, but only between them, and then only when it is necessary to choose between 
read and write mode. 

2.4.3 How to Compile Backtracking: Clause Selection Algorithms 

This section surveys the clause-selection algorithms that have been developed since the WAM. 
The WAM supports first-argument selection. It has instructions that can choose clauses based 
on the main functor of the first argument. If all of a predicate's clauses contain different main 
functors, then a hash table can be constructed and calling the predicate will avoid a choice 
point creation when the first argument is not a variable. In the general case, predicates can 
be compiled to create at most one choice point between entry and the execution of the first 
clause [24, 148]. The original WAM report describes a two-level indexing scheme which 
creates up to two choice points [163]. 

Many programs cannot profit from first-argument selection. For example, selection may depend 
on more than one argument. The following example is extracted from an actual program. The 
first two arguments are integer inputs, the third is an output (all numbers are in base two): 

get_relop(2'001, 2'001, 2'000). 
get_relop(2'001, 2'010, 2'011). 
get_relop(2'001, TOU, 2'000). 
... 33 more clauses ... 

The second example is a predicate in which selection depends on arithmetic comparisons 
instead of unification only: 

max(A, B, C) :- A<B, C=B. 
max(A, B, C) :- A>B, C=A. 

In general, selection is possible if the compiler can determine that only a subset of the clauses 
in the definition can possibly succeed, given some particular argument types at the call. An 
appropriate definition of type is given in Section 2.4.5. 

An ideal clause-selection algorithm would generate code that has the following properties: 

• It takes advantage of argument types to try only the clauses that can possibly succeed. 

• It avoids all useless choice point creations. 

• Its size is linear in the size of the program. 
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• It creates choice points incrementally, i.e., choice points contain only that part of the 
execution state that needs to be saved. 

• Its performance degradation is gradual if insufficient type information is known. 

There is no published algorithm that satisfies all these conditions. There are published algo- 
rithms that satisfy some of the conditions and do better clause selection than the WAM. Several 
algorithms create a selection tree or graph, i.e., a series of tests that determine which subset 
of clauses to try given particular arguments (e.g., [167]). Generating a naive selection tree 
may result in exponential code size for predicates encountered in real-world programs. The 
following algorithms are noteworthy: 

• Van Roy, Demoen, and Willems [149] present a compilation algorithm that generates 
a naive selection tree and creates choice points incrementally. The algorithm compiles 
clauses with four entry points, depending on whether or not there are alternative clauses, 
and whether or not a previously executed clause has created a choice point. The algorithm 
was not implemented. 

• Carlsson [25] has implemented a restricted version of the above algorithm in SICStus 
Prolog. Meier [92] has done a similar implementation in KCM-SEPIA. Choice point 
creation is split into two parts. The try and try_me_else instructions are modified to 
create a partial choice point that only contains P and TR. A new instruction, neck, is 
added. If a partial choice point exists when neck is executed, then the remaining registers 
are filled in. Two entry points are created for each clause: one when there are alternative 
clauses, and one where there are none. A neck instruction is only included in the first 
case. In SICStus, this algorithm results in a performance improvement of 7% to 15% for 
four large programs, at a cost of a 5% to 10% increase in code size. 

• Hickey and Mudambi [61] present compilation algorithms to generate a tree of tests 
and to minimize work done in backtracking. One of their selection algorithms results 
in a tree that has a quadratic worst-case size. They improve choice point management. 
The try instruction only stores registers needed in clauses after the first clause. The 
retry and trust instructions restore only those registers needed in the clause and remove 
the registers not needed in subsequent clauses. The latter operation lets the garbage 
collector recover more memory. The technique of improved choice point management 
was independently invented earlier by Andrew Turk [146] and later by Van Roy [153]. 
The technique has not yet been quantitatively evaluated. 

• Kliger [71, 72] presents a compilation algorithm that generates a directed acyclic graph 
of tests (a "decision graph"). The algorithm was extended by Korsloot and Tick for 
nondeterminate ("don't know") predicates [74]. The graph has two important properties. 
First, it never does worse than first-argument selection. Second, it has size linear in 
the number of clauses. This follows from the property that each clause corresponds to 
a unique path through the graph. Linear size is essential when compiling predicates 
consisting of a large number of clauses. 
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• The Aquarius system [153, 154] produces a selection graph for disjunctions containing 
three kinds of tests: unifications, type tests, and arithmetic comparisons. It uses heuristics 
to decide which tests to do first and whether to use linear search or hashing for table 
lookup. The nodes in the graph partition the tests occurring in the predicate. Each node 
corresponds to a subset of these tests. Unifications are only used as tests if it can be 
deduced from the predicate's type information that they will be executed in read mode. 
The type enrichment transformation adds type information to a predicate that lacks it. The 
performance of the resulting code is therefore always at least as good as first-argument 
selection. The, factoring transformation allows the system to take advantage of tests on 
variables inside of terms, by performing the term unification once for all occurrences of 
the term. The problem with Aquarius selection is similar to that of the naive selection 
tree: if too much type information is given, then the selection graph may become too 
large. 

• The Parma system [140] uses techniques similar to Aquarius. It produces efficient 
indexing code for the same three kinds of tests. To improve the clause selection, Parma 
uses transformations analogous to type enrichment and factoring. It uses optimal binary 
search for table lookup. Taylor's dissertation discusses how to choose between linear 
search, binary search, jump tables, and hashing. 

2.4.4 Native Code Compilation 

One way to improve the performance of a WAM-based system is to add instructions. For 
example, instructions can be added to do efficient arithmetic and to index on multiple arguments. 
Common instruction sequences can be collapsed into single instructions. This is quick to 
implement, but it is inherently a short-term solution. As the number of instructions increases, 
the system becomes increasingly unwieldy. 

The main insight in speeding up Prolog execution is to represent the code in terms of simple 
instructions. The first published experiments using this idea were done in 1986 by Komatsu 
et al [73, 135] at IBM Japan. These experiments gave the first demonstration that specialized 
hardware is not essential for high-performance execution of Prolog. Compilation is done in 
three steps. The first step is to compile Prolog into a WAM-like intermediate code. In the 
second step the WAM-like code is translated into a directed graph. The graph is optimized 
using rewrite rules. In the final step, the result is translated into PL.8 intermediate code and 
compiled with an optimizing compiler. For several small programs, this system demonstrated 
a fourfold performance improvement using mode hints given by the programmer. 

Around 1988, Andrew Taylor and I independently set about building full systems (Parma and 
Aquarius) that compile directly to a simple instruction set, using global analysis to provide 
information for optimizations. Both Parma and Aquarius bypass WAM instructions entirely 
during compilation. We were confident that the fine granularity of the instruction set would 
allow us to express all optimizations. Taylor presented results for his Parma system in two 
important papers [138, 139]. The first paper presents and evaluates a practical global analysis 
technique that reduces the need for dereferencing and trailing. The second paper presents 
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Kernel Prolog Code 



BAM Code 



SPARC Code 



append (A, B, C) :- 
[►(cons (A) -> 

A=[X|L1] , ■ 



C=[X|L3] , 



-append (LI, B,L3) 
A=[] , 
B=C 



) . 



procedure (append/3 ) . 

test (ne, tlst, r (0) , LI) . 

-L2 : 

r pragma (align (r ( 0 ), 2 )) . 
pragma (tag (r ( 0) , tlst ) ) 
move( [r (0) ] ,r (3) ) . 
pragma (tag(r(0),tlst)). 
..move ( [r (0) +1] , r (0) ) . 
pragma (tag ( r (2 ) , tvar ) ) . 
move (tlst A r (h) , [r (2) ] ) . 
pragma (push (term (2 ) ) ) . 
pragma (push (cons) ) . 
push (r (3) , r (h) , 1) . 
move (tvar A r (h) , r (2) ) . 
adda (r (h) , 1, r (h) ) . 
-test (eq, tlst, r (0) , L2) . 



LI : 



-L2 : 



add 
add 
-Id 
Id 
st 
add 
and 
cmp 



%13, 0, %o0 
%13, 8, %13 
[%gl-l], %g4 
[%gl+3], %gl 
%g4, [%13-9] 
-5, %g3 
3, %o0 
1 



%13, 
%gl, 
%o0, 



be, a L2 
-st %13, 



[%g3] 



equal (r (0) , tatm A [ ] , fail) 
pragma (tag ( r (2 ) , tvar ) ) . 
move(r(l) , [r(2) ] ) . 
return . 



Registers: A, LI = 

C,L3 = 

X 

(heap) 
( t emp ) 



r (0) 
r (2) 
r (3) 
r (h) 



%gl 
%g3 
%g4 
%13 
%o0 



Tags: tlst = 1 
tvar = 0 

r (h) has tlst tag 



Figure 9: The Aquarius SPARC Code for append/3 in Naive Reverse 



performance results for Parma on a MIPS processor. The first results for Aquarius were 
presented in [63], which describes the VLSI-BAM processor and its simulated performance. 
A second paper measures the effectiveness of global analysis in Aquarius [152]. Both the 
Parma and Aquarius systems vastly outperform existing implementations. They prove the 
effectiveness of compiling directly to a low-level instruction set using global analysis to help 
optimize the code. 

An important idea in both systems is uninitialized variables. An uninitialized variable is 
defined to be an unbound variable that is unaliased, i.e., there is exactly one pointer to it. An 
uninitialized variable can be represented more efficiently than a standard WAM variable. Beer 
first proposed the idea of uninitialized variables after he noticed that most unbound variables 
in the WAM are bound soon afterwards [12]. For example, this is true for output arguments 
of predicates. WAM variables are created as self-referencing pointers in memory, and need 
to be dereferenced and trailed before being bound. This is time-consuming. Beer represents 
variables as pointers to memory words that have not been initialized. He introduces several 
new tags for these variables and keeps track of them at run-time. The creation of uninitialized 
variables is simpler and they do not have to be dereferenced or trailed. Binding them reduces 
to a single store operation. In Parma and Aquarius, these variables are derived by analysis at 
compile-time. They use the same tag as other variables. 

Aquarius supports a further specialization: the "uninitialized register" variable. This idea is 
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due to Bruce Holmer. This variable is an output that is passed in a register. No memory is 
allocated for uninitialized registers, unlike standard uninitialized variables. This reduces the 
space advantage of unsafe variables. The use of uninitialized registers allows Aquarius to 
run recursive integer functions faster than popular implementations of C [154]. 6 In principle, 
all uninitialized variables can be transformed into uninitialized registers. In practice, to avoid 
losing last call optimization (see Section 2.2.4) only a subset is transformed [153]. The trade-off 
with last call optimization has not yet been studied quantitatively. 

Figure 9 shows the Aquarius intermediate codes (kernel Prolog and BAM code) and the SPARC 
code generated for append/3 in naive reverse. See Figures 3 and 4 for the Prolog source code 
and WAM code. Kernel Prolog is Prolog without syntactic sugar and extended with efficient 
conditionals, arithmetic, and cut. 

The BAM (Berkeley Abstract Machine) is an execution model with a memory organization 
similar to the WAM. The BAM defines a load-store instruction set supplemented with tagged 
addressing modes, pragmas, and five Prolog-specific instructions (dereference, trail, general 
unification, choice point manipulation, and environment manipulation). Pragmas are not 
executable but give information that improves the translation to machine code. 

In the SPARC code, tags are represented as the low two bits of a 32-bit word. This is a common 
representation that has low overhead for integer arithmetic and pointer dereferencing [52]. The 
tag of a pointer is always known at compile-time (it is put in a pragma). When following 
a pointer, the tag is subtracted off at zero cost with the SPARC'S register+displacement 
addressing mode. The compiler derives the following types for append/3: 

:- mode((append(A,B,C) :- 

ground(A), rderef(A), 
ground(B), rderef(B), 
uninit(C))). 

An uninitialized argument is represented by uninit. A ground argument contains no 
unbound variables. A recursively dereferenced (rderef) argument is dereferenced and its 
arguments are recursively dereferenced. This type generalizes the DEC-10 mode: 

:- mode append(++, ++, -). 

which states that the first two arguments are ground and the last argument is an unbound 
variable. 



6 I posted this result to the Internet newsgroup comp.lang.prolog in February 1991, with the comment: "Don't 
believe it any more that there is an inherent performance loss when using logic programming". There was a barrage 
of responses, ranging from the incredulous (and incorrect) comment "Obviously, he's comparing apples and oranges, 
since the system must be doing memoization" to the encouraging "That's telling 'em Peter". 
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2.4.5 Global Analysis 

Global analysis of logic programs is used to derive information to improve program execution. 
Both type and control information can be derived and used to increase speed and reduce 
code size. The analysis algorithms studied so far are all instances of a general method called 
abstract interpretation [34, 35, 69]. The idea is to execute the program over a simpler domain. 
If a small set of conditions are satisfied, this execution terminates and its results provide a 
correct approximation of information about the original program. Le Charlier et al [80, 81] 
have performed an extensive study of abstract interpretation algorithms and domains and their 
effectiveness in deriving types. Getzinger [50] has recently presented an extensive taxonomy 
of analysis domains and studied their effects on execution time and code size. 

Since Mellish's early work in 1981 and 1985 [96, 98], global analysis has been considered 
useful for Prolog implementation. This section summarizes the work that has been done in 
making analysis part of a real system. By type we denote any information known (at compile- 
time or at run-time) about a variable's value at run-time. A mode is a restricted type that 
indicates whether the variable is used as an input (nonvariable) or an output (unbound variable). 
Useful types include argument values, compound structures, dependencies between variables, 
and operational information such as the length of dereference chains (see also Sections 2.4.6 
and 5.1). 

In 1982, Lee Naish performed an experiment with automatically generated control information 
for MU-Prolog [106]. Control information is all information related to the execution order 
of a program's procedures. The MU-Prolog interpreter supports wait declarations. A "wait" 
declaration defines a set of arguments of a predicate that may not be constructed by a call (i.e., 
unified in write mode). When a call attempts to construct a term in any of these arguments, then 
the call delays until the argument is sufficiently instantiated so that no construction is done (i.e., 
the argument is unified in read mode). This provides a form of coroutining. The automatic 
generation of "wait" declarations is based on a simple heuristic: to delay rather than guess 
one of an infinite number of bindings. 7 A "wait" declaration is inserted for each recursive call 
that does not progress in its traversal of a data structure. This algorithm was implemented and 
tested on some small examples. It significantly reduces the programmer's burden in managing 
control, but it does not always help: if the clause head is as general as the recursive call then 
no "wait" declaration is generated, even though one might be necessary. 

A later system, NU-Prolog, supports when declarations. These are both more expressive 
and easier to compile into efficient code (see Section 3.1.3). A "when" declaration is a 
pattern containing a term with optional variables and a nested conjunction and/or disjunction 
of nonvariable and ground tests on these variables. Variables may not occur more than once in 
the term. A "when" declaration is true if unification between the term and the call succeeds and 
does not bind any variables in the call. A call will delay until its "when" declarations are true. 
This is called one-way unification or matching. NU-Prolog contains an analyzer that derives 
"when" declarations. 



7 This heuristic is closely related to the "Andorra principle" [33, 55]. The main difference is that the heuristic is 
applied at analysis time whereas the Andorra principle is applied at run-time. 
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In 1988, Richard Warren, Manuel Hermenegildo, and Saumya Debray did the first measure- 
ments of the practicality of global analysis in logic programming [60, 164]. They measured 
two systems, MA 3 , the MCC And-parallel Analyzer and Annotator, and Ms, an experimental 
analysis scheme developed for SB-Prolog. The paper concludes that both dataflow analyzers 
are effective in deriving types and do not unduly increase compilation time. 

In 1989, Andre Marien et al [87] performed an interesting experiment in which several small 
Prolog predicates (recursive list operations) were hand-compiled with four levels of optimiza- 
tion based on information derivable from a global analysis. The levels progressively include 
unbound variable and ground modes, recursively defined types, lengths of dereference chains, 
and liveness information for compile-time garbage collection. Execution time measurements 
show that each analysis level significantly improves speed over the previous level. This exper- 
iment shows that a simple analysis can achieve good results on small programs. 

Despite this experimental evidence, there was until 1993 no generally available sequential 
Prolog system that did global analysis, and since 1988 only a few research systems doing 
analysis. Why is this? I think the most important reason is that other areas of system 
development were considered more important. Commercial systems worked on improving 
their development environments: source-level debugging, a proper foreign language interface, 
and useful libraries. Research systems worked in other areas such as language design and 
parallelism. A second reason may be that the structure of the WAM (high-level compact 
instructions) does not lend itself well to the optimizations that analysis supports. A whole 
new instruction set would be needed, and the development effort involved may have seemed 
prohibitive given the existing investment in the WAM. A third reason is that analysis was 
erroneously considered impractical. 

Currently, there are at least seven systems that do global analysis of logic programs: 

• Ms, the analyzer for SB-Prolog, written by Saumya Debray. Ms derives ground and 
nonvariable types. 

• MA 3 , the analyzer for &-Prolog, written by Manuel Hermenegildo and Richard Warren. 
MA 3 derives variable sharing (aliasing) and groundness information. This information 
is used to eliminate run-time checks in the And-parallel execution of Prolog. This was 
the first practical application of abstract interpretation to logic programs. The &-Prolog 
system both derives information and uses it for optimization. PLAI, the successor to 
MA 3 , subsumes it and has been extended to analyze programs in constraint languages [49] 
and languages with delaying [90]. 

• The FCP(:,?) compiler (Flat Concurrent Prolog with Ask and Tell guards and read-only 
variables), written by Shmuel Kliger, has a global analysis phase [72]. 

• The Parma system, written by Andrew Taylor, is an implementation of Prolog with global 
analysis targeted to the MIPS processor [140]. 

• The Aquarius system is an implementation of Prolog with global analysis targeted to the 
VLSI-BAM processor and various general-purpose processors [58, 153]. An extensive 
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System 


Speedup factor 


Code size reduction 




Small 


Medium 


Small 


Medium 


Aquarius 


1.5 


1.2 


2.2 


1.8 


Parma 


3.0 


2.1 


2.9 


2.0 



Table 3: The Effectiveness of Analysis for Small and Medium-Size Programs 

study of improved analyzers and their integration in the Aquarius system is given in [50]. 

• The MU-Prolog analyzer generates "wait" declarations for coroutining [106]. Its im- 
proved NU-Prolog version generates "when" declarations. 

• The IBM Prolog analyzer. It determines whether choice points have been created or 
destroyed during execution of a predicate, and whether there are pointers into the local 
stack. This improves the handling of unbound variables and the management of envi- 
ronments. The IBM Prolog analyzer has been available since 1989 (see Section 3.1.6). 
There is no published information on the analyzer. 

Of these systems, five were developed for sequential Prolog (the MU-Prolog and IBM Prolog 
analyzers, Ms, Parma, and Aquarius) and two for parallel systems (MA 3 and the FCP(:,?) 
analyzer). Three (MA 3 , Parma, and Aquarius) have been integrated into Prolog systems and 
their effects on performance evaluated and published [60, 140, 153]. The analysis domains 
of Aquarius and Parma are shown in Figure 10. For both analyzers the analysis time is linear 
in program size and performance is adequate. Four analyzers (MA 3 , FCP(:,?), Aquarius, and 
IBM Prolog) are robust enough for day-to-day programming. 

The effect of the Aquarius and Parma analyzers on speed and code size is shown in Table 3. 
The "Small" column refers to a standard set of small benchmarks (between 10 and 100 lines). 
The "Medium" column refers to a standard set of medium-size benchmarks (between 100 and 
1000 lines). These benchmarks are well-known in the Prolog programming community [156]. 
They do tasks for which Prolog is well-suited and are written in a good programming style. 
The numbers are taken from [140, 152, 153]. The numbers can be significantly improved by 
tuning the programs to take advantage of the analyzers. 

For the medium-size benchmarks, the Aquarius analyzer finds uninitialized, ground, nonvari- 
able, and recursively dereferenced types in 23%, 21%, 10%, and 17% of predicate arguments, 
respectively, and 56% of predicate arguments have types. 8 One third of the uninitialized types 
are uninitialized register types, so about one twelfth of all predicate arguments are outputs 
passed in registers. On the VLSI-BAM this results in a reduction of dereferencing from 11% 
to 9% of execution time and a reduction of trailing from 2.3% to 1.3% of execution time. 

The Parma analyzer's domain has been split into parts and their effects on performance measured 
separately. For the medium-size benchmarks, performance is improved through dereference 
chain analysis by 14%, trailing analysis by 8%, structure/list analysis by 22%, and uninitialized 

8 Arguments can have more than one type. 
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Aquarius domain 

top 




ground nonvar+ uninit 
rderef 



uninit_reg 




• rderef = recursively dereferenced. It is 
dereferenced and its arguments are recursively 
dereferenced. 

• uninit = uninitialized (unaliased and unbound). 
Binding needs no dereferencing nor trailing. 

• uninit_reg = uninitialized register. An output 
that is passed in a register. 



bottom 



Parma domain 



top (may_alias ) 




bound (may_alias 



free (may_alias, must_alias, is_aliased, 
trailf lag, deref f lag) 



list (car, cdr) term ( functor (altype , . . . , antype 

ground 



const 
number atom 



integer float 
bottom 



• list(car,cdr) = {cdr, [carlcdr],...,[car,...,carlcdr]}, 
i.e., it is able to represent difference lists. 

• Objects with the same must_alias values 
are certainly aliased. 

• Objects with different may_alias values 
are certainly not aliased. 

• Terms are nested to four levels. 



Figure 10: The Analysis Domains of the Aquarius and Parma Systems 
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variables by 12%. The combined benefit of two analysis features is usually not their product, 
since features may compete with or enhance each other. For example, uninitialized variables 
do not need to be trailed, and this fact will often also be determined by the trailing analysis. 

Two conclusions can be drawn by studying the effects of analysis in the Parma and Aquarius 
systems. Analysis results in both code size reduction and speed increase. A first conclusion is 
that the effects of analysis on code size and speed are fundamentally different. Derived types 
allow both tests and the code that handles other types to be removed. Tests are usually fast. 
The code to handle all possible outcomes of the tests can be very large. For Aquarius the code 
size reduction is greater than the performance improvement. This is partly due to the lack of 
structure and list types in the Aquarius domain, which means that run-time type tests are still 
needed. For Parma the code size reduction is about the same as the performance improvement. 

A second conclusion can be drawn regarding the types that are most useful for the compiler. 
Deriving types that have a logical meaning is not sufficient. Performance increases significantly 
when the analysis is able to derive types that have only an operational meaning, such as 
dereference (reference chains), trailing, and aliasing-related types (uninitialized variables). 

2.4.6 Using Types when Compiling Unification 

It is just as hard to use analysis in the compiler as it is to do analysis. Very little has been 
published on how to use analysis. This section shows how unification is compiled in the 
Aquarius system to take maximum advantage of the types known at compile-time. The code 
generated by the two-stream algorithm of Section 2.4.2 handles the general case when no types 
are known. If types are known then compiling unification becomes a large case analysis. 9 
Even after common cases are factored out, the number of cases remains large. Figure 1 1 gives 
a simplified view of the top two levels of the case analysis done in Aquarius. 

Table 4 gives details of the case analysis done in Aquarius for the compilation of the unification 
X=Y with type T. The compiler attempts to use all possible type information to simplify the 
code. A general unify instruction is only generated once (in oldvar.oldvar), namely when 
unifying two initialized variables for which nothing is known. For simplicity, the table omits 
the generation of dereference and trail instructions, the handling of uninitialized memory and 
uninitialized register variables, the updating of type information when variables are bound, the 
generation of pragmas, and various less important optimizations. See Section 2.4.4 for more 
information. 

The variable T denotes the type information known at the start of the unification. The impli- 
cation (T=>ground(X)) is true if T implies that X is bound to a ground term at run-time. The 
conditions var(X) and (T=>var(X)) are very different: the first tests whether X is a variable at 
compile-time, and the second tests whether X is a variable at run-time. The condition new(X) 
succeeds if X does not yet have a value, i.e., for the first occurrence of X in the clause or if X is 
uninitialized. The condition old(X) is the negation of new(X). The function atomic_value(T,X) 
succeeds if T implies that X is an atomic term whose value is known at compile-time. The 



'Compiling a goal invocation (a call) is also a large case analysis [153]. 
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Definition of routines 


Name 


Condition 


Actions 


unify(X,Y) 


var(X),var(Y) 


var_var(X,Y) 




var(X) ,nonvar( Y) 


var_nonvar(X,Y) 




nonvar(X) , var( Y) 


var_nonvar(Y,X) 




nonvar(X) ,nonvar( Y) 


nonvar_nonvar(X,Y) 


nonvar_nonvar(X,Y) 




For all arguments X,-, Y,: unify (X,-,Y,) 


var_nonvar(X,Y) 


T^new(X) 


new_old(X,Y) 




T^ground(X) 


old_old(X,Y) 




otherwise 


old_old(X,Y) (with depth limiting) 


var_var(X,Y) 


T=Kold(X),old(Y)) 


oldvar_oldvar(X,Y) 




T=Kold(X),new(Y)) 


Generate store instruction 




T=Knew(X),old(Y)) 


Generate store instruction 




T^(new(X),new(Y)) 


new_new(X,Y) 


ntw_iitw^y\, j_ i 




VJtlltlu.Lt alUlC U.11U. IIIUVC- 111BL1 LltllwllB 


ncw_uiu^yv, I J 


toinpounu\ i ) 


Wlllt J^t^Utntt^YY, I ) 




unjiiiitv^ ) 


C^f , x\F , Yf{\F , ^trvrp in^tnipf inn 

VJtlltlU.Lt SLUlt lllSLl LitLlUll 




V ell ^ 1 ^ 


vat" vjitO^" Y^ 

V ul _ V ul y / v , A J 


old nlriYY Y i* 




Tp^t Y'« tvnp thpn nld nld rpadfX" Y^ 
ltf>l 1 a LjJJt, Llltll 01U._01U._ltaU.^y\., 1 ) 






nld old reariYX Y\ 

VJ 1 LI _VJ 1 Li _1 1 tlLl ^ / Y , 1_ ^ 






nld nld writpfY Y^ 




compound^ i ) 


OtlltlaLt s Wilt 11, U1U_U1U _ltaU^yV, I J, 

old_old_write(X,Y) 




atomic(Y) 


Generate unify .atomic instruction 




var(Y) 


var_var(X,Y) 


oldvar_oldvar(X,Y) 


A=atomic_value(T,X) 


unify(Y,A) 




A=atomic_value(T,Y) 


unify(X,A) 




T^(atomic(X),atomic(Y)) 


Generate comparison instruction 




T^(var(X),nonvar(Y)) 


Generate store instruction 




T^(nonvar(X),var(Y)) 


Generate store instruction 




otherwise 


Generate general unify instruction 


old_old_write(X,Y) 


compound(Y) 


write _sequence(X,Y) 




atomic(Y) 


Generate store instruction 


old_old_read(X,Y) 


compound(Y) 


Test Y's functor, then for all arguments 
X,-, Y ; : old_old(X ; ,Y ; ) 




atomic(Y) 


Generate comparison instruction 


write _sequence(X,Y) 




Generate instructions to create com- 
pound term Y in X 



Table 4: The Case Analysis in the Aquarius Compilation of the Unification X=Y with Type T 
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Figure 1 1 : Case Analysis in Compiling Unification 



function returns this atomic term. For example, if T is (X==a), then the function returns the 
atom 'a'. 

The general unify instruction is expanded into code that handles inline the cases where one 
or both of the arguments are variables. Measurements of the dynamic behavior of unification 
on four real programs show that one or both of the arguments are variables about 85% of the 
time [63]. A subroutine call is made only if both arguments are nonvariables. 

2.5 Beyond the WAM: Radically Different Execution Models 

Some recent developments in Prolog implementation are based on novel models of execution 
very different from the WAM. The Vienna Abstract Machine (VAM) is based on partial 
evaluation of each call. The BinProlog system is based on the explicit passing of success 
continuations. 

2.5. 1 The Vienna Abstract Machine (VAM) 

The VAM is an execution model developed by Andreas Krall at the Technische Universitat 
Wien (Vienna, Austria) [77]. The VAM is considerably faster than the WAM. The insight of the 
VAM is that the WAM's separation of argument setup from argument unification is wasteful. 
In the WAM, all of a predicate's arguments are built before the predicate is called. The VAM 
does argument setup and argument unification at the same time. During the call the operations 
of argument setup and unification are combined into a single operation that does the minimal 
work necessary. This results in considerable savings in many cases. For example, consider 
the call p(X,[a,b,c],Y) to the definition p(A,_,B). The second argument [a,b,c] is not created 
because it is a void variable in the head of the definition. In the WAM, the second argument 
would be created and then ignored in the definition. 
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There exist two versions of the VAM: the VAMi p and the VAM.2 P . The difference is in how 
the argument traversal is done. In the VAM2 P there are two pointers. One points to the caller's 
arguments and one points to the definition's arguments. The operation to be performed for each 
argument is obtained by a two-dimensional array lookup depending on the types of the caller 
argument and the definition argument. This lookup operation can be made extremely fast by 
a technique similar to direct threaded coding, where the address of the abstract instruction is 
obtained by adding two offsets. In the VAMip there is a single pointer that points to compiled 
code representing the caller-definition pair. The code size for the VAMi p is much greater than 
for the VAM2 P , since the called predicate must be compiled separately for each call. Currently, 
the VAM2 P is a practical implementation, whereas the VAMi p is not because of code size 
explosion. 

2.5.2 BinProlog 

BinProlog 10 is a high performance C-based emulator developed by Paul Tarau at the Uni- 
versite de Moncton (Canada) [36, 136, 137]. BinProlog has two key ideas: transforming 
clauses to binary clauses and passing success continuations. The resulting instruction set is 
essentially a simplified subset of the WAM. Implementing Prolog by means of continuations 
is an old technique. It was used to implement Prolog on Lisp machines and in Pop-11, see 
for example [23, 97]. The technique has recently received a boost by Tarau's highly efficient 
implementation. Functional languages have more often been implemented by means of contin- 
uations. A good example is the Standard ML of New Jersey system, which uses an intermediate 
representation in which all continuations are explicit ("Continuation-Passing Style") [9]. 

The idea of BinProlog is to transform each Prolog clause into a binary clause, i.e., a clause 
containing only one body goal. Predicates that are expanded inline (such as simple built-ins) are 
not considered as goals. The body goal is given an extra argument, which represents its success 
continuation, i.e., the sequence of goals to be executed if the body completes successfully. 
This representation has two advantages. First, no environments are needed. Second, the 
continuations are represented at the source level. For example, the clauses: 

P(X, X). 

p(A, B) :- q(A, C), r(C, D), s(D, B). 

are transformed into: 

p(X, X, Cont) :- call(Cont). 

p(A, B, Cont) :- q(A, C, r(C, D, s(D, B, Cont))). 

Each predicate is given an additional argument and each clause is converted into a binary 
clause. 

With a well-chosen data representation, the binary clause transformation can be the basis of 
a system that uses very little memory yet compiles and executes very quickly. The technique 

10 There is no relationship with BIM Prolog. 
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as currently implemented has two problems. First, the continuations are put on the heap 
("long-lived" memory), hence they do not disappear on forward execution as environments 
would in the WAM. That is, there is no last call optimization (see Section 2.2.4) if the original 
clause body contains more than one goal. Second, if the first goal fails, then the creation of 
the continuation is an overhead that is avoided in the WAM. Both of these problems are less 
severe than they appear at first glance. The first problem goes away with a suitable garbage 
collector. A copying collector has execution time proportional to the amount of active memory. 
A generational mark-and-sweep collector can perform even better in practice [168]. The second 
problem almost never occurs in real programs. 

An important potential use of the binary clause transformation is as a tool for source transfor- 
mation in Prolog compilers. By making the continuations of the WAM explicit as data terms, 
a series of optimizing transformations becomes possible at the source level [1 10]. After doing 
the optimizations, a reverse transformation to standard clauses can be done. 

3 The Systems View 

The previous sections have summarized developments from the technical viewpoint, focusing 
on particular developments without giving further information about the systems that pioneered 
them. 

This section concentrates on the systems themselves. It tells the stories of some of the 
more popular and influential systems, of the people and institutions behind them, and of the 
particular problems they encountered. The section is divided into two parts. Section 3.1 talks 
about software systems and Section 3.2 talks about hardware systems. 

3.1 Software Sagas 

Since the development of the WAM in 1983 there have been many software implementations 
of Prolog. At the end of 1993, more than fifty systems were listed in the Prolog Resource 
Guide [70]. The systems discussed here are MProlog, IF/Prolog, SNI-Prolog, MU-Prolog, 
NU-Prolog, Quintus, BIM, IBM Prolog, SEPIA, ECLiPSe, SB-Prolog, XSB, SICStus, and 
Aquarius. 

All of these systems are substantially compatible with the Edinburgh standard. They have 
been released to users and used to build applications. Many have served as foundations for 
implementation experiments. In particular, MU-Prolog, NU-Prolog, SB-Prolog, XSB, SICStus, 
and Aquarius are delivered with full source code. Quintus, MProlog, IF/Prolog, and SICStus 
are probably the implementations that have been ported to the largest number of platforms. 
The most popular systems on workstations today are SICStus and Quintus. C-Prolog was also 
very popular at one point. 

For each system are listed its most important contributions to implementation technology. These 
lists are not exhaustive. Most of the important "firsts" have since been incorporated into many 
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other systems. In some cases, a contribution was developed jointly or spread too fast to identify a 
particular system as the pioneer. For example, almost all commercial systems support modules. 
Likewise, almost all commercial systems have a full-featured foreign language interface, and 
many of them (including Quintus, BIM, IF/Prolog, SNI-Prolog, SICStus, and ECLiPSe) allow 
arbitrarily nested calls between Prolog and C. 

IF/Prolog, SNI-Prolog, IBM Prolog, SEPIA, ECLiPSe, and SICStus support rational tree 
unification. Rational trees account for term equations which express cycles. For example, the 
term equation X=f(X) has a solution over rational trees, but does not over finite trees [66, 68]. 

All of the compiled systems except MProlog and Aquarius are based on the WAM instruction 
set, but modified and extended to increase performance. MProlog, BIM, IBM Prolog, SEPIA, 
ECLiPSe, and Aquarius support mode declarations and multiple-argument indexing. The other 
systems do not support mode declarations. Quintus, NU-Prolog, and XSB provide some support 
for multiple-argument indexing, and IF/Prolog, SNI-Prolog, and SICStus do not implement it. 
IBM Prolog, SEPIA, ECLiPSe, and Aquarius index on some other conditions than unification, 
for example on arithmetic comparisons and type tests. Quintus, BIM, SEPIA, ECLiPSe, XSB, 
SB-Prolog, but not SICStus, compile conditionals (if-then-else) deterministically in the special 
case where the condition is an arithmetic comparison or a type test. 

The most interesting problems of system building are related to input size. These scalability 
problems tend to occur only when one exercises a system on large inputs. They are the main 
obstacles on the long path between research prototype and production quality systems. For 
each system are listed some of the more interesting such problems. 

3. 1. 1 MProlog 

The first commercial Prolog system was MProlog. 11 MProlog was developed in Hungary 
starting in 1978 at NIMIGUSZI (Computer Center of the Ministry of Heavy Industries) [13, 47]. 
The main developer is Peter Szeredi, aided by Zsuzsa Farkas and Peter Koves. MProlog was 
completed at SZKI (Computer Research and Innovation Center), a computer company set up 
a few years before. The implementation is based on Warren's pre-WAM three-stack model of 
DEC-10 Prolog. The first public demonstration was in 1980 and the first sale in September 
1982. 

MProlog is a full-featured structure-sharing system with all Edinburgh built-ins, debugging, 
foreign language interface, and sophisticated I/O. It shows that structure-sharing is as efficient as 
structure-copying [75]. Its implementation was among the most advanced of its day. Early on, 
it had a native code compiler, memory recovery on forward execution (including tail recursion 
optimization) and support for mode declarations (including multiple-argument indexing). It 
had garbage collection for the symbol table and code area. It did not and does not do garbage 
collection for the stacks. MProlog is currently a product of IQSOFT, a company formed in 
1990 from the Theoretical Lab of SZKI. 



M for Modular or Magyar. 
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3. 1.2 IF/Prolog and SNI-Prolog 

IF/Prolog was developed at InterFace Computer GmbH, which was founded in 1982 in Munich, 
Germany. Nothing has been published about the implementation of IF/Prolog. The following 
information is due to Christian Pichler. IF/Prolog was commercialized in 1983. The first release 
was an interpreter. A WAM-based compiler was released in 1985. The origin of the compiler 
is an early WAM compiler developed by Christian Pichler [116]. The main developers of 
IF/Prolog were Preben Folkjasr, Christian Reisenauer, and Christian Pichler. Siemens-Nixdorf 
Informationssysteme AG bought the IF/Prolog sources in 1986. They ported and extended the 
system, which then became SNI-Prolog. 

In 1990, SNI-Prolog was completely redesigned from scratch. Pichler went to Siemens-Nixdorf 
to help in the redesign. The main developers of SNI-Prolog are Reinhard Enders and Christian 
Pichler. The current system conforms to the ISO Prolog standard [122], supports constraints, 
has been ported to more platforms and has improved system behavior (more flexible interfaces 
and less memory usage). The design of the new system benefited from the fact that Siemens is 
one of the shareholders of ECRC. Siemens-Nixdorf bought the rights to IF/Prolog in 1993 after 
InterFace disappeared. They plan to integrate the best features of IF/Prolog and SNI-Prolog 
into a single system. 

Both systems support rational tree unification. In addition, SNI-Prolog has delaying, indefinite 
precision rational arithmetic, and constraint solvers for boolean constraints, linear inequalities, 
and finite domains. It has metaterms, which allow constraint solvers to be written in the 
language itself (see Section 3.1.7). 

Both SNI-Prolog and IF/Prolog have extensive C interoperability. With regard to interoper- 
ability they can best be compared with Quintus (see Section 3.1.4). They allow redefinition of 
the C and Prolog top levels and arbitrary calls between Prolog and C to any level of nesting 
with efficient passing of arbitrary data (including compound terms). They have configurable 
memory management and garbage collection of all Prolog memory areas. They are designed 
to interact correctly with the Unix memory system and to support signal handlers. 

3.1.3 Mil -Prolog and Nil -Prolog 

MU-Prolog and NU-Prolog were developed at Melbourne University by Lee Naish and his 
group [106]. Both systems do global analysis to generate delaying declarations (see Sec- 
tion 2.4.5). Neither system does garbage collection. 

MU-Prolog is a structure-sharing interpreter. The original version (1.0) was written by John 
Lloyd in Pascal. Version 2.0 was written by Naish and completed in 1982. Version 2.0 supports 
delaying, has a basic module system and transparent database access. Performance is slightly 
less than C-Prolog. 

NU-Prolog is a WAM-based emulator written in C primarily by Jeff Schultz and completed 
in 1985. It is interesting for its pioneering implementation of logical negation, quantifiers, 
if-then-else, and inequality, through extensions to the WAM [105]. The delay declarations 
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("when" declarations) are compiled into decision trees with multiple entry points. This avoids 
repeating already performed tests on resumption. It results in indexing on multiple arguments 
in practice. NU-Prolog was the basis for many implementation experiments, e.g., related to 
parallelism [107, 114], databases [118], and programming environments [108]. 

3.1.4 Quintus Prolog 

Quintus Prolog is probably the best-known commercial Prolog system. Its syntax and semantics 
have become a de facto standard, for several reasons. It is close to the Edinburgh syntax and is 
highly compatible with C-Prolog. It was the first widely known commercial system. Several 
other influential systems (e.g., SICStus Prolog) were designed to be compatible with it. The 
pending ISO standard for Prolog [122] will most likely be close in syntax and semantics to the 
current behavior of Quintus. 

Quintus Computer Systems was founded in 1984 in Palo Alto, California. It is currently 
called Quintus Corporation, and is a wholly-owned subsidiary of Intergraph Corporation. The 
founders of Quintus are David H. D. Warren, Lawrence Byrd, William Kornfeld, and Fernando 
Pereira. They were joined by David Bowen shortly thereafter, and Richard O'Keefe in 1985. 
Tim Lindholm was responsible for many improvements including discontiguous stacks and 
the semantics for self-modifying code (see below). Many other people contributed to the 
implementation. Quintus Prolog 1.0 first shipped in 1985. 

Quintus Prolog compiles to an efficient and compact direct threaded-code representation. For 
portability and convenience, the emulator is written in Progol, a macro-language which is 
essentially a macro-assembler for Prolog using Prolog syntax. The mode flag does not exist 
explicitly, but is cleverly encoded in the program counter by giving the unify instructions two 
entry points. 

Quintus Prolog made several notable contributions, including those listed below. 

• It is the Prolog system that generates the most compact code. Common sequences of 
operations are encoded as single opcodes. The code size is several times smaller than 
native code implementations. For example, the code generated for a given input program 
is about one fifth the size of that generated by the BIM compiler. The code size is 
between one fifth and one half that of Aquarius Prolog. The figure of one half holds only 
when the global analysis of Aquarius performs well [154]. For applications with large 
databases, compact code can become significant. The recent rapid increase in physical 
memory size makes reducing code size less of a priority, although there will always be 
applications {e.g., databases and natural language) that require compact code to run well. 

• It was the Prolog system that first developed a foreign language interface. Since then, 
it is the Prolog system that has put the most effort into making the system embeddable. 
It is important to be able to seamlessly integrate Prolog code with existing code. This 
implies a set of capabilities to make the system well-behaved and expressive. Quintus 

12 The name is a contraction of Prolog and Algol. 
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is able to redefine the C and Prolog top levels. It allows arbitrary calls between Prolog 
and C, with efficient manipulation of Prolog terms by C and vice versa. It has an open 
interface to the operating system that lets one redefine the low-level interfaces to memory 
management and I/O. It does efficient memory management, e.g., it was the first system 
to run with discontiguous stacks. This "small footprint" version has been available since 
release 3.0. It carefully manages the Prolog memory area to avoid conflicts with C. It 
provides tools for the user including source-level debugging on compiled code and an 
Emacs interface. 

• It was the first system to provide a clean and justified semantics for self-modifying code 
(assert and retract), namely the logical view [82]. A predicate in the process of being 
executed sees the definition that existed at the time of the call. 

• It is the system that comes with the largest set of libraries of useful utilities. More than 
one hundred libraries are provided. 

3. 1.5 BIM Prolog (Prolog by BIM) 

BIM Prolog was developed by the BIM company in Everberg, Belgium in close collaboration 
with the Catholic University of Louvain (KUL). The name has recently been changed to 
"ProLog by BIM" due to a copyright conflict with the prefix "BIM" in the United States. 

Logic programming research at KUL started in the mid 1970's. Maurice Bruynooghe had 
developed one of the early Prolog systems, Pascal Prolog, which was used at BIM at that 
time. The BIM Prolog project started in October 1983. It was then called P-Prolog (P for 
Professional). Its execution model was originally derived from the PLM model in Warren's 
dissertation, but was quickly changed to the WAM. 

The first version of BIM Prolog, release 0.1, was distributed in October 1984 and used in an 
ESPRIT project. It was a simple WAM-based compiler and emulator. Meanwhile, Quintus 
had released their first system. The BIM team realized that they needed to go further than 
emulation to match the speed of Quintus, so they decided immediately to do a native code 
implementation through macro-expansion of WAM instructions. In contrast to Quintus, which 
intended to cover all major platforms from the start, BIM initially concentrated on Sun and 
decided to do a really good implementation there. By 1985 the team consisted of the three 
main developers who are still there today: Bart Demoen, Andre Marien, and Alain Callebaut. 
Other people have contributed to the implementation. Because BIM Prolog only ran on a few 
machines, it was possible for different implementation ideas to be tried over the years. For 
more information on the internals of BIM Prolog, see [89]. 

BIM Prolog made several notable contributions, including those listed below. 

• It was the first WAM-based system: 

- To do native code compilation. 

- To do heap garbage collection. The Morris constant-space pointer-reversal algo- 
rithm was available in release 1.0 in 1985. 
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- To do symbol table garbage collection. This is important if the system is interfaced 
to an external database. 

- To support mode declarations and do multiple-argument indexing, instead of in- 
dexing only on the first argument. 

- To provide modules. 

These abilities were provided earlier by DEC-10 Prolog (see Section 2.1) and MProlog 
(see Section 3.1.1). 

• It was the first system to provide a source-level graphical debugger, an external database 
interface, and separate compilation. 

3.1.6 IBM Prolog 

IBM Prolog was developed primarily by Marc Gillet at IBM Paris. Nothing has been pub- 
lished about the implementation. The following information is due to Gillet and the system 
documentation [67]. The first version, a structure-sharing system, was written in 1983-1984 
and commercialized in 1985 as VM/Prolog. A greatly rewritten and extended version was 
commercialized in 1989 as IBM Prolog. 13 It runs on system 370 under the VM and MVS 
operating systems. The system was ported to OS/2 with a 370 emulator. 

The system is WAM-based and supports delaying, rational tree unification, and indefinite 
precision rational arithmetic. The system does global analysis at the level of a single module 
(see Section 2.4.5). It supports mode declarations, but may generate incorrect code if the 
declarations are incorrect. The system generates native 370 code and has a foreign language 
interface. 

3.1.7 SEPIA and ECLiPSe 

ECRC (European Computer-Industry Research Centre) was created in Munich, Germany in 
1984 jointly by three companies: ICL (UK), Bull (France), and Siemens (Germany). ECRC has 
done research in sequential and parallel Prolog implementation, in both software and hardware. 
See Section 3.2.2 for a discussion of the hardware work. The constraint language CHIP was 
built at ECRC (see Section 2.3.1). 

Several Prolog systems were built at ECRC. An early system is ECRC-Prolog (1984-1986), a 
Prolog-to-C compiler for an enhanced MU-Prolog. At the time, ECRC-Prolog had the fastest 
implementation of delaying. The next system, SEPIA (Standard ECRC Prolog Integrating 
Advanced Features), first released in 1988, was a major improvement [93]. Other systems are 
Opium [44], an extensible debugging environment, and MegaLog [15], a WAM-based system 
with extensions to manage databases (e.g., persistence). The most recent system, ECLiPSe 
(ECRC Common Logic Programming System) [45, 95], integrates the facilities of SEPIA, 
MegaLog, CHIP, and Opium. The system supports rational tree unification and indefinite 

I3 Curiously, both systems are written mostly in assembly code, several hundred thousand lines worth. 
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precision rational arithmetic. It provides libraries that implement constraint solvers for atomic 
finite domains and linear inequalities. 

ECLiPSe is a WAM-based emulator with extensive support for delaying [95]. This makes it 
easy to write constraint solvers in the language itself. ECLiPSe supports this with two concepts: 
metaterms and suspensions. A metaterm is a variable with a set of user-defined attributes. The 
set of attributes is similar to a Lisp property list. A suspension is a closure. It is an opaque 
data type at the Prolog level. A goal can be delayed explicitly by making it into a suspension 
and inserting it into a list of delayed goals. The list is stored as an attribute of a variable. 
When the variable is unified, an event handler is invoked. The handler is free to manipulate the 
suspended goals in any way. Through metaterms, the wakeup order of suspended goals can be 
programmed by the user. 

The ECLiPSe compiler is incremental and compilation time is probably the lowest of any 
major system. The debugger uses compiled code supplemented with debugging instructions. 
Because of its fast compilation, ECLiPSe has no need of an interpreter. ECLiPSe (and SEPIA 
before it) uses two-word (64-bit) data items, with a 32-bit tag and a 32-bit value field. This 
allows more flexibility in tag assignment and full pointers can be stored directly in the value 
field. It also makes for a more straightforward C interface. 

3.1.8 SB- Prolog and XSB 

SB-Prolog is a WAM-based emulator developed by a group led by David Scott Warren at 
SUNY (State University of New York) in Stony Brook. The compiler was written by Saumya 
Debray and the system was bootstrapped with C-Prolog. After several years of development, 
SB-Prolog was made available by Debray from Arizona in 1986. Because it was free and 
portable, it became quite popular. Neither it nor XSB does garbage collection. The worst 
problem regarding portability was the use of the BSD Unix syscall system call which supports 
arbitrary system calls through a single interface. 

SB-Prolog was the basis for much exploration related to language and implementation (e.g. , [37]): 
backtrackable assert, existential variables in asserted clauses, memoizing evaluation, register 
allocation, mode and type inferencing (see Section 2.4.5), module systems, and compilation. 

The most recent system, XSB, is SB-Prolog extended with memoization (tabling) and HiLog 
syntax [1 19]. The resulting engine is the SLG-WAM (see Section 2.3.3). XSB 1.3 implements 
the SLG-WAM for modularly stratified programs, i.e., for programs that do not dynamically 
have recursion through negation. 

3.1.9 S I CStus Prolog 

SICStus Prolog 14 was developed at SICS (Swedish Institute of Computer Science) near Stock- 
holm, Sweden. SICS is a private foundation founded in late 1985 which conducts research in 
many areas of computer science. It is sponsored in part by the Swedish government and in 
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part by private companies. The guiding force and main developer of SICStus is Mats Carls- 
son. Many other people have been part of the development team and have made significant 
contributions. 

In 1993, SICStus Prolog was probably the most popular high performance Prolog system 
running on workstations. SICStus is cheap, robust, fast, and highly compatible with the 
"Edinburgh standard". It has been ported to many machines. It has flexible coroutining, 
rational tree unification, indefinite precision integer arithmetic, and a boolean constraint solver. 

The first version of SICStus Prolog, release 0.3, was distributed in 1986. SICStus became 
popular with the 0.5 release in 1987. Originally, SICStus was an emulated system written in 
C. MC680X0 and SPARC native code versions were developed in 1988 and 1991. The current 
version, release 2.1, has been available since late 1991. 

SICStus is the first system to do path compression ("variable shunting") of dereference chains 
during garbage collection [120]. The parts of a dereference chain in the same choice point 
segment are removed. This lets the garbage collector recover more memory. This is essential 
for Prologs that have freeze or similar coroutining programming constructs [24], since the 
intermediate variables in a dereference chain may contain large frozen goals that can be 
recovered. 

Among the scalability problems encountered during the development of SICStus are those 
listed below. 

• Interface with malloc/free, the Unix memory allocation library. SICS wrote their own 
version of the malloc/free library that better handles the allocation done by their system. 
Increasing the size of system areas is done by calling realloc. 

• Native code limitations. A problem for large programs is that the offsets in machine 
instructions have a limited size. For example, the SPARC'S load and store instructions 
use a register+displacement addressing mode with a displacement limited to 12 bits. 
Other native code systems (e.g., IBM Prolog) have run into the same problem. 

• The space versus time trade-off. Native code implementations have larger code size than 
emulated implementations. This difference can be quite significant: a factor of five or 
more. For large programs, e.g., natural language parsers with large databases, having 
compact code can mean the difference between a program that runs and one that thrashes. 
SICStus minimizes the size of its generated native code by calling little-used operations 
as subroutines rather than putting them inline. For example, the dereference operation 
is inlined only for a predicate's first argument. 

3.1.10 Aquarius Prolog 

Aquarius Prolog was originally developed in the context of the Aquarius project at U.C. Berke- 
ley as the compiler for the VLSTBAM processor [153] (See Section 3.2.3 for the hardware side 
of the story). After our relationship with the hardware side of the project ended in the spring of 
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1991, Ralph Hay good (the main developer of the back-end, run-time system, and built-ins) and 
I decided to continue part-time work on the software so that it could be released to the general 
public [58, 70]. We were joined by Tom Getzinger at USC. The system achieved 1.1 MLIPS 
on a SPARCstation 1+ in February 1991. It first successfully compiled itself in February 1992. 
It was completed and released as Aquarius Prolog 1.0 in April 1993. 

Aquarius Prolog made several contributions, including those listed below. 

• It is the first system to compile to native code without a WAM-like intermediate stage. It 
compiles first to BAM code (see Section 2.4.4), and then macro-expands to native code. 

• It is the first well-documented system to do global analysis. See Section 2.4.5 for more 
information on the analyzer and Sections 2.4.4 and 2.4.6 for information on how the 
analyzer is used to improve code generation. Type and mode declarations are supported. 
They are used to supplement the information generated by analysis. The system may 
generate incorrect code if the declarations are incorrect. 

• It is the first system in which most built-ins are written in Prolog with little or no 
performance penalty. A technique called entry specialization replaces built-ins by more 
specialized entry points depending on argument types known at compile-time. 

• It is the first system to generate code which rivals the performance of an optimizing C 
compiler on a nontrivial class of programs [154]. 

The main disadvantage of Aquarius in its current state is the time of compilation. This has little 
to do with the sophistication of the optimizations performed, but is due primarily to the naive 
representation of types in the compiler. The representation was chosen for ease of development, 
not speed. It is user-readable and new types can be added easily. 

Among the scalability problems encountered during the development of Aquarius are those 
listed below. 

• Garbage collection with uninitialized variables. Before they are bound, uninitialized 
variables contain unpredictable information. The garbage collector must be able to 
handle this correctly. In Aquarius, the garbage collector follows all pointers, including 
uninitialized variables. Hence, it does not recover all the memory it could. As far as 
we can tell, following all pointers does not adversely affect the system in practice. All 
programs we have tried, including very long-running ones, have stable memory sizes. 

• Interaction of memory management with malloc. The observed behavior was that the 
system crashed because some stdio routines called malloc, which returned memory 
blocks inside the Prolog heap. Calling the default malloc is incompatible with our 
memory manager because it expands memory size if more memory is needed. After 
such an expansion, the malloc-allocated memory is inside a Prolog stack. On some 
platforms there is a routine, Lprealloc, that ensures that stdio routines do all of their 
allocation at startup. This does not work for all platforms. Our final solution uses a 
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public domain malloc/free package (written by Michael Schroeder) that is given its own 
region of memory upon startup. 

• During the DECstation port, a bug was found in the MIPS assembler provided with 
the system. The assembler manual states that registers t0-t9 ($8-$15, $24-$25) are not 
preserved across procedure calls. The MIPS instruction scheduler apparently assumes 
that they need not be saved even across branches, but this is not documented. We solved 
the problem with the directive ".set nobopt", which prevents the scheduler from moving 
an instruction at a branch destination into the delay slot. This results in slightly lower 
performance. The problem went undiscovered until we made the system self-compiling. 

3.2 Hardware Histories 

Starting in the early 1980's there was interest in building hardware architectures optimized for 
Prolog. Two events catalyzed this interest: the start of the Japanese Fifth Generation Project 
in 1982 and the development of the WAM in 1983. In 1984 Tick and Warren proposed a paper 
design of a microcoded WAM that was influential for these developments [142]. At first, the 
specialized architectures were mostly microcoded implementations of the WAM (e.g. , the PLM 
and the PSI-II). Later architectures (e.g., the KCM and the VLSI-BAM) modified the WAM 
design. 

Some of the most important efforts are the PSI and CHI machine projects primarily at ICOT, 
the KCM project at ECRC, the POPE project at the GMD in Berlin, the Pegasus project at 
Mitsubishi, the Aquarius project at U.C. Berkeley (with its commercial offspring, Xenologic 
Inc.), and the IPP project at Hitachi. All these groups built working systems. 

The POPE (Parallel Operating Prolog Engine) design is based on extracting fine-grain paral- 
lelism in WAM instructions [11]. The POPE was built in Berlin at the GMD (Gesellschaft 
fur Mathematik und Datenverarbeitung) in the late 1980's. The machine is a ring of up to 
seven tightly coupled sequential Prolog processors. Parallelism is achieved at each call by 
interleaving argument setup with head unification. The head unification is done on the next 
machine in the ring. In this fashion, the machine is automatically load balanced and achieves 
a speedup of up to seven. 

The IPP (Integrated Prolog Processor) [78] is a Hitachi ECL superminicomputer of cycle time 
23ns (& 43.5 MHz) with 3% added hardware support for Prolog. The IPP was built in the late 
1980's. The support comprises an increased microcode memory of 2 KW and tag manipulation 
hardware. The IPP implements a microcoded WAM instruction set modified to reduce pipeline 
bubbles and memory references. Its performance is comparable to Aquarius Prolog on a 
SPARCstation 1+ (see Table 7). 

In the late 1980's came the first efforts to build RISC processors for Prolog. These include 
Pegasus, LIBRA [101], and Carmel-2 [56] (the latter supports Flat Concurrent Prolog). For 
lack of appropriate compiler technology, these systems executed macro-expanded WAM code 
or hand-coded assembly code. 
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The Pegasus project began in 1986 at Mitsubishi. They designed and fabricated three single- 
chip RISC microprocessors in the period 1987-1990 [125]. The first two chips were fabricated 
in October 1987 and August 1988 [123, 124, 166]. The first chip contains 80,000 transistors 
in an area of 10 mm square (« 100 mm 2 ) with 2fi CMOS. The second chip contains 80,000 
transistors in an area of 9.7 mm square (^94 mm 2 ) with 1.5/i CMOS. The third and last 
chip, Pegasus-II, was fabricated in September 1990 and at 10 MHz achieves a performance 
comparable to the KCM (see Table 7). The third chip contains 144,000 transistors in an 
area of 9.3 mm square (« 86.5 mm 2 ) with 1.2/i CMOS. The last two chips ran the Warren 
benchmarks a few months after fabrication. The chips have a bank of shadow registers to 
improve the performance of shallow backtracking. They provide support for tagging and 
dereferencing with ideas similar to those of the VLSI-BAM and KCM. Pegasus-II has two 
interesting features. It provides support for context dependent execution (which the designers 
call "dynamic execution switching") of read/write mode in unification (see Section 3.2.2). It 
provides compound instructions (pop & jump, push & jump, pop & move, push & move) to 
exploit data path parallelism. 

By 1990, the appropriate compiler technology was developed on two RISC machines. The 
VLSI-BAM, a special-purpose processor, ran Aquarius Prolog [63]. The MIPS R3000, a 
general-purpose processor, ran Parma [139]. The VLSI-BAM has a modest amount of archi- 
tectural support for Prolog (10.6% of active chip area). Parma achieves a somewhat greater 
performance on a general-purpose processor at the same clock rate (see Table 7). The major 
difference between the two systems is that Parma has a bigger type domain in its analysis (see 
Figure 10 and Section 2.4.5). 

The experience with Aquarius and Parma proves that there is nothing inherent in the Prolog 
language that prevents it from being implemented with execution speed comparable to that of 
imperative languages. Comparing the two systems shows that improved analysis lessens the 
need for architectural support. 

Since 1990 the main interest in special-purpose architectures has been as experiments to guide 
future general-purpose designs. The interest in building special-purpose architectures for their 
own sake has died down. Better compilation techniques and increasingly faster general-purpose 
machines have taken the wind out of its sails (see also Section 5.1). This parallels the history 
of Lisp machines. 

The rest of this section examines three projects in more detail: the PSI machine project 
(ICOT/Mitsubishi/Oki), the KCM project (ECRC), and the Aquarius project (U.C. Berkeley). 
I have chosen these projects for two reasons: they show clearly how system performance 
improved as Prolog was better understood, and detailed measurements were performed on 
them. 

3.2. 1 ICOT and the PSI Machines 

The FGCS (Fifth Generation Computer System) project at ICOT (Japanese Institute for New 
Generation Technology) has designed and built a large number of sequential and parallel Prolog 
machines [134, 147]. Both in manpower and machines, the FGCS was the largest architecture 
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project in the logic programming community. Two series of sequential machines were built: 
the PSI (Personal Sequential Inference) machines (PSI-I, PSI-II, and PSI-III) and the CHI 
(Cooperative High performance sequential Inference) machines (CHI-I and CHI-II) [54]. I 
will limit the discussion to the PSI machines, which were the most popular. All the PSI 
machines are horizontally microprogrammed and have 40-bit data words with 8 -bit tag and 
32-bit value fields. 

The PSI-I was developed before the WAM [133]. After the development of the WAM this 
was followed by two WAM-based machines, the PSI-II and PSI-III. The three models were 
manufactured by Mitsubishi and Oki, and commercialized by Mitsubishi in Japan. Several mul- 
tiprocessors were built at ICOT with these processors as their sequential processing elements. 
The PSI-II is the PE of the Multi-PSI/v2 and the PSI-III is the PE of the PIM/m. 

The PSI-I was designed as a personal workstation for logic programming. It was first operational 
in December 1983 at a clock rate of 5 MHz. It runs ESP (Extended Sequential Prolog), a Prolog 
extended with object-oriented features. More than 100 machines were shipped. The first ESP 
implementation was an interpreter written in microcode (not a WAM). A WAM emulator was 
later written for the PSI-I and ran twice as fast. The main advantage of the PSI-I was not speed, 
but memory. It had 80 MB of physical memory, a huge amount in its day. 

The PSI-II was first operational in December 1986 [109]. More than 500 PSI-II machines were 
shipped from 1987 until 1990 and delivered primarily to ICOT. Its clock was originally 5 MHz 
but was quickly upgraded to 6.45 MHz. At the higher clock, its average performance is 3 to 4 
times that of the interpreted PSI-I. 

The PSI-III was first operational near the end of 1990. More than 200 PSI-III machines have 
been shipped. It is binary compatible with the PSI-II and has almost the same architecture with 
a clock rate of 15 MHz. The microcode was ported from the PSI-II by an automatic translator. 
Its average performance is 2 to 3 times that of the upgraded PSI-II. 

3.2.2 ECRC and the KCM 

The architecture work at ECRC culminated in the KCM (Knowledge Crunching Machine) 
project, which started in 1987 [14, 112]. The KCM was probably the most sophisticated 
Prolog machine of the late 1980's. It had an innovative architecture and significant compiler 
design was done for it. It was preceded by two years of preliminary studies (the ICM, ICM3, 
and ICM4 architectures) [111, 165]. The KCM was built by Siemens. The first prototypes 
were operational in July 1988 and ran at a clock speed of 12.5 MHz. About 50 machines were 
delivered to ECRC and its member companies [141]. 

The KCM is a single user, single tasking, dedicated coprocessor for Prolog, used as a back-end 
to a Unix workstation. It is a tagged general-purpose design with support for Prolog, and hence 
is not limited to Prolog. It uses 64-bit data words, with a 32-bit tag and a 32-bit value field. 

The KCM's instruction set consists of two parts: a general-purpose RISC part and a microcoded 
WAM-like part. Prolog compilation for the KCM is still WAM-like, but the instructions have 
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Feature 


Benefit (%) 


multiway tag branch (MWAC) 


23.1 


context dependent execution (flags) 


1 1.4 


dereferencing support 


10.0 


trail support 


7.2 


load term 


5.7 


fast choice point creation/restoration 


2.3 


Total 


59.7 



Table 5: The Benefits of Prolog-Specific Features in the KCM 



evolved greatly from Warren's original design (see [92, 1 12]). The KCM supports the delayed 
creation of choice points. The KCM runs KCM-SEPIA, a large subset of SEPIA that was 
ported to it (see Section 3.1.7). 

The Prolog support on the KCM improves its performance by a factor of « 1.60 [112, 141]. 
The architectural features and their effects on performance are given in Table 5. 

The MWAC (Multi-Way Address Calculator) is a functional unit that does a 16-way microcode 
branch depending on the types of two arguments. It calculates the target address during the 
last step of dereferencing. The MWAC is used in the execution of all unification operations. It 
is similar to the partial unification unit of the LIBRA [101]. 

Context dependent execution uses flags in addition to the opcode during instruction decoding. 
Three flags are used: read/write mode for unification, choicepoint/nochoicepoint for delayed 
choice point creation, and deep/shallow for fast fail in shallow backtracking. 

3.2.3 The Aquarius Project: The PLM and the VLSI-BAM 

In 1983, Alvin Despain and Yale Patt at U.C. Berkeley initiated the Aquarius project. Its 
main goal was to design high performance computer systems with large symbolic and numeric 
components. The project continued at Berkeley until 1991. 15 They decided to focus on 
Prolog architectures, being inspired by the FGCS project and seduced by the mathematical 
simplicity of Prolog. As soon as Warren presented the WAM at Berkeley, Despain turned the 
project to focus on hardware support for the WAM. He proposed that I write a compiler for 
their architecture, the PLM. The compiler was completed and the report was delivered to the 
university on August 22, 1984. 16 This was the first published WAM compiler [148]. 17 

A whole series of sequential and parallel Prolog architecture designs came out of Aquarius. 
The sequential designs that were built are: 



Despain is continuing this work at USC's Advanced Computer Architecture Laboratory. 
16 The exact day of my flight back to Belgium. 

17 In January 1991, 1 toured several German universities and research institutes to talk about Aquarius Prolog. At 
ECRC a scientist from East Berlin came to me after the talk. He explained that they had typed in the source code 
of the PLM compiler from the appendix of the report. 
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• The PLM [42, 43] (1983-87). The Programmed Logic Machine. 18 This is a microcoded 
WAM. 

• The VLSI-PLM [128, 129] (1985-89). This is a single-chip implementation of the PLM. 

• The Xenologic X-l. This is a commercial version of the PLM, designed as a copro- 
cessor for the Sun-3. Due to weaknesses in its system software, this system was not 
commercially successful. 

• The VLSI-BAM [63] (1988-91). The VLSI Berkeley Abstract Machine. This is a 
single-chip RISC processor with extensions for Prolog. 

The PLM was wire- wrapped and ran a few small programs in 1985. The Xenologic X-l has 
been running at 10 MHz since 1987. The VLSI-PLM was fabricated and ran all benchmarks 
at 10 MHz in June 1989. The VLSI-BAM was designed to run at 30 MHz. It contains 1 10,000 
transistors in an active area of 91 mm 2 with 1.2/i CMOS. It was fabricated in November 1990 
and ran most benchmarks of [154, 156] at 20 or 25 MHz on its custom cache board in November 
1991. 

The core of the VLSI-BAM is a RISC in the classic sense: it is a 32-bit pipelined load-store 
architecture with single-cycle instructions, 32 registers, and branch delay slots. The processor 
is extended with support for Prolog and for multiprocessing, which together form 10.6% of the 
active chip area and improve Prolog performance by a factor of « 1.70 [63]. The VLSI-BAM 
executes the same Prolog program in one third the cycles of the VLSI-PLM, a gain due to 
improved compilation. 

The primary purpose in building the VLSI-BAM was not to achieve the best absolute perfor- 
mance-a university project cannot compete in performance with industry-but to quantify the 
usefulness of its architectural features. The intention was that the results could be used to guide 
the design of other machines. 

The Prolog support takes the form of six architectural features and new instructions using them. 
The architectural features, their performance benefits, and their active chip area are given in 
Table 6. The benefit figures cannot be directly added up because the effects of the architectural 
features are not independent. 

Except for dereference, the instructions are all single-cycle. There are two- and three-way 
tagged branches to support unification and a conditional push to support trailing. The instruc- 
tions for data structure creation (write-mode unification) were derived automatically using 
constrained exhaustive search [64]. VLSI-BAM measurements [63] show that with advanced 
compilation techniques, multiway branches for general unification are effective only up to a 
three-way branch. 19 Multiple-cycle (primarily dereference) and conditional instructions are 
implemented by logic to insert or remove opcodes in the pipeline. The opcode pipe has space 

18 The name correspondence with the PLM model in Warren's dissertation [159] is a coincidence. 
"This does not contradict the measurements of the KCM's MWAC since the latter is used for all unification 
operations, not just general unification. 
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Feature 


Benefit (%) 


Area (%) 


last tag logic (tagged branching) 


18.9 


1.6 


double-word memory port 


n i 
VIA 


1 o 

i.y 


tag and segment mapping 


10.3 


4.8 


multi-cycle/conditional 


9.1 


0.1 


tagged-immediates 


7.9 


2.2 


arithmetic overflow detect 


1.4 


<0.1 


Total 


70.1 


10.6 



Table 6: The Benefits and Chip Area of Prolog-Specific Features in the VLSI-BAM 

for both user instructions and added "internal" instructions. The double-word memory port 
(with double bandwidth to cache) improves general-purpose memory operations as well as 
choice point creation and restoration speed. 

4 The Evolution of Performance 

Due to faster machines and improved compilation technology, the performance of Prolog has 
increased about two orders of magnitude since DEC- 10 Prolog. Table 7 gives the execution 
time ratios, relative to DEC- 1 0 Prolog, of a set of representative systems running the five Warren 
benchmarks [159]. For the reasons given below, the numbers in Table 7 do not generalize to 
large programs. They should be seen only as indicating trends. 

Table 7 is split into two parts. The first five rows show the performance of specialized hardware. 
The following rows show general-purpose hardware. For the first five rows and for DEC- 10 
Prolog the year in which the systems were first running is given. For the other systems the 
architecture is given. Results for the benchmarks nreverse, qsort, deriv, serialise, and query 
are given in columns N, Q, D, S, and R, respectively. Table 8 gives their absolute execution 
times on DEC- 10 Prolog. The benchmarks were timed with a failure-driven loop. The deriv 
benchmark is the sum of the four benchmarks timeslO, loglO, dividelO, and ops8. The last 
column of Table 7 gives the harmonic mean of the speedup ratios. 

Performance is one of the few quantifiable measures of a system. Many other measures are 
just as important, but are hard to quantify. For example, it is difficult to assign numbers to 
embeddability, robustness, debuggability, portability, and the usefulness of the available built- 
in operations. The overall quality of a system depends on how well it meets the needs of the 
task at hand. A rough indication of overall quality can be obtained from the software sagas 
presented earlier. This should be refined for a particular system by using it to solve a relevant 
problem. 

The systems marked by f are research systems. The systems on general-purpose hardware that 
are marked by % are native code systems. The others are emulated. The numbers for XSB 1.3 
are within 10% of SB-Prolog 3.1. Many of the systems generate better code if the program has 
mode declarations. For example, IBM Prolog is about 1.5 times faster with mode declarations. 
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System 


Machine (Year) 


Clock 


Benchmark 






(MHz) 


N 


Q 


D 


S 


R 


mean 


|PLM compiler [148] 


PLM [43] (1985) 


10. 


19 


12 


9 


12 


8 


11 


ESP 


PSI-II (1986) 


6.45 


41 


25 


12 


18 


10 


16 


KCM-SEPIA [112] 


KCM(1989) 


12.5 


83 


57 


37 


33 


15 


32 


| Pegasus compiler [125] 


Pegasus-II(1990) 


10. 


91 


69 


39 


40 


19 


39 


| Aquarius [63] 


VLSI-BAM (1991) 


20. 


270 


260 


75 


57 


32 


72 
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37 


16 


10 


10 


5 


10 


Quintus 2.5 [154] 


SPARCstation 1+ (SPARC) 


25 


33 


16 


9 


13 


8 


12 


1BIM3.1 beta [154] 


SPARCstation 1+ (SPARC) 


25 


34 


21 


8 


16 


8 


13 


iSICStus 2.1 


SPARCstation 1+ (SPARC) 


25 


39 


26 


15 


20 


8 


17 


i| Aquarius [154] 


SPARCstation 1+ (SPARC) 


25 


120 


140 


28 


25 


12 


29 


1IBM Prolog 


ES/9000 Model 9021 (370) 




120 


59 


74 


69 


33 


60 


i Aquarius 1.0 


DECstation 5000/200 (R3000) 


25 


180 


210 


63 


44 


46 


71 


itParma [140] 


MIPS R3230 (R3000) 


25 


330 


350 


130 


170 


59 


140 



Table 7: The Evolution of Prolog Performance 



MProlog 2.3 is about 1.2 times faster with mode and indexing declarations. On the same PC 
clone, emulated SICStus 2.1 is 1.5 times slower than MProlog 2.3 and five times slower than 
native SICStus 2.1 on a SPARCstation 1+. 

The Warren benchmarks were chosen because reliable performance numbers for them are 
available for many machines. They are not a good measure of the performance of real programs. 
A more realistic benchmark set that subsumes the Warren benchmarks is used in [140, 154] 
and may be obtained from [156]. 

The Warren benchmarks are small and many systems have been optimized to execute them 
fast. The speedup for nreverse is greater than average because more effort has been done to 
optimize it. The speedup for query is less than average because it is dominated by integer 
multiplication and division. Due to limitations in their analysis domains (see Section 2.4.5), 
Aquarius and Parma have lower performance for large programs unless the programs are tuned. 
Large programs are more likely to spend most of their time doing built-in operations, which 
are a fixed cost since they are usually implemented in a lower level language. 

In older publications, a common unit in Prolog performance is the LIPS, or Logical Inferences 
Per Second, i.e., the number of goal invocations or procedure calls per second. Because the 
amount of work done by a procedure call is not constant, the LIPS number is an unreliable 
indicator of system performance and is not given. By convention, published LIPS numbers are 
measured for nreverse, which reverses a 30-element list in 496 logical inferences. 
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Benchmark 


N 


Q 


D 


S 


R 


Time (ms) 


53.7 


75.0 


10.1 


40.2 


185.0 



Table 8: The Execution Times of the Warren Benchmarks on DEC- 10 Prolog 

It is difficult to compare the performance of two systems unless they are running on identical 
hardware. For example, the same system can vary greatly in speed even when running the 
same CPU-bound program on two machines with the same processor, clock speed, and cache 
size. This could be the case because the write buffers are of different sizes. Among the 
machine-related factors that affect performance are clock speed, but also the memory system 
(i.e., cache and virtual memory structure, memory size and bandwidth), the operating system 
(e.g., speed of I/O and context switching overhead), the data path (e.g., pipeline structure, 
multiple functional units, out-of-order and superscalar execution), and the implementation of 
various primitive operations (e.g., multiplication can vary an order of magnitude in speed even 
on systems with the same clock). An important difference between the SPARC-based and 
R3000-based systems in Table 7 is that the latter have a faster memory system. 

5 Future Paths in Logic Programming Implementation 

This section gives a personal view of the trends in sequential logic programming implementa- 
tion. It is important to distinguish three levels of evolution. First, the low level trends. What 
will be the basic improvements in implementation technology for Prolog and related languages? 
Second, the high level trends. What will be the new tools, new languages, and programming 
paradigms? Finally, what will be the relation between Prolog and the mainstream computing 
community? See [48] for an early but still useful discussion of these issues. 

5.1 Low Level Trends 

There are many ways in which Prolog implementation technology can be improved. Here are 
some of the important ones, given in order of increasing difficulty: 

• Overlap with mainstream compiler technology. As Prolog compilers approach imper- 
ative language performance, the standard optimizations of imperative language compilers 
(global register allocation, code motion, instruction reordering, and so forth) become im- 
portant. Some of these are being implemented in current systems [38]. One approach 
is to compile to C. This shortens development time, gains portability, and (to a lesser 
degree) takes advantage of what the C compiler does (e.g., register allocation). This 
approach has traditionally had a performance loss over native code of a factor of two to 
three. This will change in the future. For example, because of its first-class labels and 
global register declarations, the recently released GNU C 2.X compiler has a smaller 
performance loss than other C compilers [36, 57]. Recent work shows that the over- 
head of compilation to C can be reduced to less than 30%, while keeping the system 
portable [99]. C is becoming a portable assembly language. 
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• Type inference and operational types. When writing a program, a programmer often 
has definite intentions about the types of predicate arguments. This includes information 
on the structure of compound terms (e.g., recursive types such as lists and trees) and on 
operational types (see Section 2.4.5). For analysis to work well with large programs as 
well as small benchmarks, the analysis domain has to represent this information, to track 
variable dependencies, and to correctly handle built-in predicates. Objects whose type 
is known at compile-time can be represented unboxed, i.e., accessible without tagging or 
other overhead. Current systems only unbox variables (see the discussion on uninitialized 
variables in Section 2.4.4) and numbers within arithmetic expressions. 

• Determinism extraction. Often, a deterministic user-defined predicate is used to select 
a clause. This is currently compiled by creating a choice point, executing the predicate, 
and backtracking if it fails. It would be more efficient to compile such a predicate as a 
boolean function and to do a conditional jump on its result. 

• Multiple specialization. Different calls to the same predicate frequently have different 
types in the same argument. The predicate will run faster if it is compiled separately 
for each pattern of calling types. As a first step, multiple specialization can be enabled 
by a directive. Profiling could supply the directives. Measurements show that adding 
these directives is often fruitful. For example, in the chat.parser benchmark the inner 
loop is a two-clause predicate, terminal/5, that is called 22 times. Making 22 copies 
and recompiling with analysis under Aquarius Prolog results in a 16% performance 
improvement. In programs with tighter inner loops the performance improvement can 
be much greater. For example, the SEND+MORE=MONEY puzzle shows a tenfold 
speedup [155]. 

• Compile-time garbage collection. Prolog creates three kinds of data objects in memory: 
choice points, compound data terms, and environments. When a data object becomes 
inaccessible, a new object can often reuse part of the old one. For example, a program 
that uses an array can destructively update the array if it is unaliased (see Section 2.4.4). 
Unaliased arrays are called single-threaded. Recent developments indicate that it is more 
practical to enforce single-threadedness syntactically (through source transformation) 
than to use an analyzer-compiler combination [65]. See for example the use of monads 
in functional programming [158] and the Extended Definite Clause Grammar notation 
of [151, 153] which is extended in [7]. 

• Dynamic to static conversion. All data in Prolog is allocated dynamically, i.e., at run- 
time. It is accessed through tagged pointers. Often, it is necessary to follow a chain of 
pointers to find the data. Since CPU speed is increasing faster than memory speed [59], 
the overhead of memory access will become relatively more important in the future. The 
software and hardware approaches to speed up memory access are complementary: 

- A future compiler could statically allocate part of the dynamically allocated data 
to reduce access time and improve locality. This requires analysis to determine 
the evolution of aliasing during program execution. For example, objects that are 
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unaliased, that exist only in one copy at any given time, and whose size is known 
can be allocated statically. 

- A future architecture could be designed to tolerate memory latency. If the architec- 
ture could follow one level of tagged pointer in zero time, then the execution model 
of Prolog could be changed drastically and would run faster. Two techniques that 
help are starting to appear in existing architectures: asynchronous loads (decou- 
pling the load request and arrival of the result) and multithreading (fast switching 
between register sets). These are useful for all languages, not just Prolog. 

5.2 High Level Trends 

The development of both Prolog and more advanced logic languages are active areas of research. 
In recent years, the implementation of logic programming systems has continued in two main 
directions. 

• Further development of Prolog. 

- Software engineering aspects: this development has been mostly in the area of 
extended usability of the system rather than performance. For example, many 
systems including Quintus, SICStus, BIM, and ECLiPSe, have a foreign language 
interface that allows arbitrary calls between Prolog and C, to any level of nesting. 
Debugging has improved, and several systems now have source-level debuggers 
and profilers [51]. Many systems have eased the strict control flow by including 
coroutining facilities (such as freeze). There is an ISO standard for Prolog that is 
essentially complete [122]. 

- "Cleaner" Prologs: these languages aim to keep the ideas and functionality of 
Prolog, but to replace the "dirty" operational features (such as assert, var, and 
cut) by clean declarative ones. It is not yet obvious whether this is possible 
without losing expressivity and performance. This group includes the MU-Prolog 
and NU-Prolog family [104] (see Section 3.1.3), xpProlog [83], and the Godel 
language [62]. 

• Other logic programming languages. These can be roughly subdivided into three main 
families. The families overlap, but the division is still useful. 

- Concurrent languages: these languages include the committed-choice languages 
[126] {e.g., Parlog, FGHC, and FCP) and languages based on the "Andorra princi- 
ple" [33, 55] (an elegant synthesis of Prolog and committed-choice languages). 

- Constraint languages: a language that does incremental global constraint solving 
in a particular domain is called a constraint language. These languages come 
in two flavors. The general-purpose languages (such as Prolog, Trilogy [157], 
and LIFE [5]) provide domains that are useful for most programming tasks. For 
example, unification in Prolog handles equality constraints over finite trees. The 
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special-purpose languages (such as Prolog III, CLP(R), and CHIP) provide spe- 
cialized domains that are useful for particular kinds of problems. For example, 
linear arithmetic inequalities on real numbers and membership in finite domains. 
These languages allow practical solutions to many problems previously considered 
intractable such as optimization problems with large search spaces. 

- "Synthesis" languages: there are now serious attempts to make syntheses of dif- 
ferent styles of programming [53]. For example, AProlog [100] and languages 
based on narrowing are syntheses of logic and functional programming, LIFE is a 
synthesis of logic, functional, and object-oriented programming, and AKL [55] is 
a synthesis of concurrent and constraint languages [121]. An important principle 
is that a synthesis must start from a simple theoretical foundation. 

5.3 Prolog and the Mainstream 

As measured by the number of users, commercial systems, and practical applications, Prolog 
is by far the most successful logic programming language. Its closest competitors are surely 
the special-purpose constraint languages. But it is true that logic programming in particular 
and declarative programming in general remain outside of the mainstream of computing. Two 
important factors that hinder the widespread acceptance of Prolog are: 

• Compatibility. Existing code works and investment in it is large. Therefore people will 
not easily abandon it for new technology. Therefore a crucial condition for acceptance 
is that Prolog systems be embeddable. This problem has been solved to varying degrees 
by commercial vendors (see Section 3.1.4). 

• Public perception. To the commercial computing community, the terms "Prolog" and 
"logic programming" are at best perceived as useful in an academic or research setting, 
but not useful for industry. This image is not based on any rational deduction. Changing 
the image requires both marketing and application development. 

The ideas of logic programming will continue to be used in those application domains for 
which it is particularly suited. This includes domains in which program complexity is beyond 
what can be managed in the imperative paradigm. 

6 Summary and Conclusions 

This survey summarizes the technical developments in sequential Prolog implementation during 
the past decade and the systems that pioneered them. Much has happened in this time, and 
I hope that the survey is successful in capturing most of the important developments and in 
pointing out some intriguing trends for the future. 

The WAM opened the floodgates for a proliferation of systems and ideas. It was the substrate 
upon which most sequential Prolog development took place in the past decade. Nowadays, the 
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WAM is no longer the best model to use for high performance. But it continues to be useful 
as a conceptual model, and the compilation principle that underlies the WAM is still highly 
relevant: to compile a logic language, simplify each occurrence of one of its basic operations 
with all the information at one's disposal. The last decade has seen an increased understanding 
of how this can be done: by measuring actual programs to optimize frequent operations, by 
learning how to compile unification and backtracking, and by using simpler instruction sets 
and global analysis. 

The Prolog language has proven to be an elegant implementation target. The language has been 
generalized in many ways. There have been large advances in implementation technology, but 
there is still plenty to do, both in implementing Prolog and its successors. The next decade 
promises to be as interesting as the first. 
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